Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025

      These solid-state fans will revolutionize cooling in our PCs and laptops

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025
      Recent

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025

      A Comprehensive Guide to Azure Firewall

      June 3, 2025

      Test Job Failures Precisely with Laravel’s assertFailedWith Method

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025
      Recent

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Automate Amazon EKS troubleshooting using an Amazon Bedrock agentic workflow

    Automate Amazon EKS troubleshooting using an Amazon Bedrock agentic workflow

    April 16, 2025

    As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform administrators face increasing challenges in efficiently managing multi-tenant clusters. Tasks such as investigating pod failures, addressing resource constraints, and resolving misconfiguration can consume significant time and effort. Instead of spending valuable engineering hours manually parsing logs, tracking metrics, and implementing fixes, teams should focus on driving innovation. Now, with the power of generative AI, you can transform your Kubernetes operations. By implementing intelligent cluster monitoring, pattern analysis, and automated remediation, you can dramatically reduce both mean time to identify (MTTI) and mean time to resolve (MTTR) for common cluster issues.

    At AWS re:Invent 2024, we announced the multi-agent collaboration capability for Amazon Bedrock (preview). With multi-agent collaboration, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Because troubleshooting an EKS cluster involves deriving insights from multiple observability signals and applying fixes using a continuous integration and deployment (CI/CD) pipeline, a multi-agent workflow can help an operations team streamline the management of EKS clusters. The workflow manager agent can integrate with individual agents that interface with individual observability signals and a CI/CD workflow to orchestrate and perform tasks based on user prompt.

    In this post, we demonstrate how to orchestrate multiple Amazon Bedrock agents to create a sophisticated Amazon EKS troubleshooting system. By enabling collaboration between specialized agents—deriving insights from K8sGPT and performing actions through the ArgoCD framework—you can build a comprehensive automation that identifies, analyzes, and resolves cluster issues with minimal human intervention.

    Solution overview

    The architecture consists of the following core components:

    • Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context while routing user prompts to specialized agents, managing multistep operations and agent interactions
    • Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod events through K8sGPT’s Analyze API for security issues, misconfigurations, and performance problems, providing remediation suggestions in natural language
    • Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation through ArgoCD, handling rollbacks, resource optimization, and configuration updates

    The following diagram illustrates the solution architecture.

    Architecture Diagram

    Prerequisites

    You need to have the following prerequisites in place:

    • The AWS Command Line Interface (AWS CLI) version 2. For installation instructions, refer to Installing or updating to the latest version of the AWS CLI.
    • An EKS cluster.
    • helm.
    • Kubectl.
    • Amazon Bedrock model access (In this post, we used Anthropic Claude 3.5 Sonnet v1) in the AWS Region of deployment.
    • Download the accompanying AWS CloudFormation template. The template is dependent on downloading resources from an Amazon Simple Storage Service (Amazon S3) bucket provisioned in the US East (N. Virginia) us-east-1 AWS Region. Hence, it’s restricted to running in the us-east-1 Region only.

    Set up the Amazon EKS cluster with K8sGPT and ArgoCD

    We start with installing and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

    The K8sGPT operator will help with enabling AI-powered analysis and troubleshooting of cluster issues. For example, it can automatically detect and suggest fixes for misconfigured deployments, such as identifying and resolving resource constraint problems in pods.

    ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes that automates the deployment of applications by keeping the desired application state in sync with what’s defined in a Git repository.

    The Amazon Bedrock agent serves as the intelligent decision-maker in our architecture, analyzing cluster issues detected by K8sGPT. After the root cause is identified, the agent orchestrates corrective actions through ArgoCD’s GitOps engine. This powerful integration means that when problems are detected (whether it’s a misconfigured deployment, resource constraints, or scaling issue), the agent can automatically integrate with ArgoCD to provide the necessary fixes. ArgoCD then picks up these changes and synchronizes them with your EKS cluster, creating a truly self-healing infrastructure.

    1. Create the necessary namespaces in Amazon EKS:
      kubectl create ns helm-guestbook
      kubectl create ns k8sgpt-operator-system
    2. Add the k8sgpt Helm repository and install the operator:
      helm repo add k8sgpt https://charts.k8sgpt.ai/
      helm repo update
      helm install k8sgpt-operator k8sgpt/k8sgpt-operator 
        --namespace k8sgpt-operator-system
    3. You can verify the installation by entering the following command:
      kubectl get pods -n k8sgpt-operator-system
      
      NAME                                                          READY   STATUS    RESTARTS  AGE
      release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   0         1d
      

    After the operator is deployed, you can configure a K8sGPT resource. This Custom Resource Definition(CRD) will have the large language model (LLM) configuration that will aid in AI-powered analysis and troubleshooting of cluster issues. K8sGPT supports various backends to help in AI-powered analysis. For this post, we use Amazon Bedrock as the backend and Anthropic’s Claude V3 as the LLM.

    1. You need to create the pod identity for providing the EKS cluster access to other AWS services with Amazon Bedrock:
      eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:policy/AmazonBedrockFullAccess  --region $AWS_REGION
    2. Configure the K8sGPT CRD:
      cat << EOF > k8sgpt.yaml
      apiVersion: core.k8sgpt.ai/v1alpha1
      kind: K8sGPT
      metadata:
        name: k8sgpt-bedrock
        namespace: k8sgpt-operator-system
      spec:
        ai:
          enabled: true
          model: anthropic.claude-v3
          backend: amazonbedrock
          region: us-east-1
          credentials:
            secretRef:
              name: k8sgpt-secret
              namespace: k8sgpt-operator-system
        noCache: false
        repository: ghcr.io/k8sgpt-ai/k8sgpt
        version: v0.3.48
      EOF
      
      kubectl apply -f k8sgpt.yaml
      
    3. Validate the settings to confirm the k8sgpt-bedrock pod is running successfully:
      kubectl get pods -n k8sgpt-operator-system
      NAME                                                          READY   STATUS    RESTARTS      AGE
      k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Running   9 (22d ago)   22d
      release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   3 (10h ago)   22d
      
    4. Now you can configure the ArgoCD controller:
      helm repo add argo https://argoproj.github.io/argo-helm
      helm repo update
      kubectl create namespace argocd
      helm install argocd argo/argo-cd 
        --namespace argocd 
        --create-namespace
    5. Verify the ArgoCD installation:
      kubectl get pods -n argocd
      NAME                                                READY   STATUS    RESTARTS   AGE
      argocd-application-controller-0                     1/1     Running   0          43d
      argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Running   0          43d
      argocd-dex-server-55d5769f46-58dwx                  1/1     Running   0          43d
      argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Running   0          43d
      argocd-redis-587d59bbc-rndkp                        1/1     Running   0          43d
      argocd-repo-server-76f6c7686b-rhjkg                 1/1     Running   0          43d
      argocd-server-64fcc786c-bd2t8                       1/1     Running   0          43d
    6. Patch the argocd service to have an external load balancer:
      kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
    7. You can now access the ArgoCD UI with the following load balancer endpoint and the credentials for the admin user:
      kubectl get svc argocd-server -n argocd
      NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
      argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d
    8. Retrieve the credentials for the ArgoCD UI:
      export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
      -o jsonpath="{.data.password}" | base64 -d`
      
      echo ArgoCD admin password - $argocdpassword
    9. Push the credentials to AWS Secrets Manager:
      aws secretsmanager create-secret 
      --name argocdcreds 
      --description "Credentials for argocd" 
      --secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"
    10. Configure a sample application in ArgoCD:
      cat << EOF > argocd-application.yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
      name: helm-guestbook
      namespace: argocd
      spec:
      project: default
      source:
      repoURL: https://github.com/awsvikram/argocd-example-apps
      targetRevision: HEAD
      path: helm-guestbook
      destination:
      server: https://kubernetes.default.svc
      namespace: helm-guestbook
      syncPolicy:
      automated:
      prune: true
      selfHeal: true
      EOF
    11. Apply the configuration and verify it from the ArgoCD UI by logging in as the admin user:
      kubectl apply -f argocd-application.yaml

      ArgoCD Application

    12. It takes some time for K8sGPT to analyze the newly created pods. To make that immediate, restart the pods created in the k8sgpt-operator-system namespace. The pods can be restarted by entering the following command:
      kubectl -n k8sgpt-operator-system rollout restart deploy
      
      deployment.apps/k8sgpt-bedrock restarted
      deployment.apps/k8sgpt-operator-controller-manager restarted

    Set up the Amazon Bedrock agents for K8sGPT and ArgoCD

    We use a CloudFormation stack to deploy the individual agents into the US East (N. Virginia) Region. When you deploy the CloudFormation template, you deploy several resources (costs will be incurred for the AWS resources used).

    Use the following parameters for the CloudFormation template:

    • EnvironmentName: The name for the deployment (EKSBlogSetup)
    • ArgoCD_LoadBalancer_URL: Extracting the ArgoCD LoadBalancer URL:
      kubectl  get service argocd-server -n argocd -ojsonpath="{.status.loadBalancer.ingress[0].hostname}"
    • AWSSecretName: The Secrets Manager secret name that was created to store ArgoCD credentials

    The stack creates the following AWS Lambda functions:

    • <Stack name>-LambdaK8sGPTAgent-<auto-generated>
    • <Stack name>-RestartRollBackApplicationArgoCD-<auto-generated>
    • <Stack name>-ArgocdIncreaseMemory-<auto-generated>

    The stack creates the following Amazon Bedrock agents:

    • ArgoCDAgent, with the following action groups:
      1. argocd-rollback
      2. argocd-restart
      3. argocd-memory-management
    • K8sGPTAgent, with the following action group:
      1. k8s-cluster-operations
    • CollaboratorAgent

    The stack outputs the following, with the following agents associated to it:

    1. ArgoCDAgent
    2. K8sGPTAgent
    • LambdaK8sGPTAgentRole, AWS Identity and Access Management (IAM) role Amazon Resource Name (ARN) associated to the Lambda function handing interactions with the K8sGPT agent on the EKS cluster. This role ARN will be needed at a later stage of the configuration process.
    • K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
    • ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
    • CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

    Assign appropriate permissions to enable K8sGPT Amazon Bedrock agent to access the EKS cluster

    To enable the K8sGPT Amazon Bedrock agent to access the EKS cluster, you need to configure the appropriate IAM permissions using Amazon EKS access management APIs. This is a two-step process: first, you create an access entry for the Lambda function’s execution role (which you can find in the CloudFormation template output section), and then you associate the AmazonEKSViewPolicy to grant read-only access to the cluster. This configuration makes sure that the K8sGPT agent has the necessary permissions to monitor and analyze the EKS cluster resources while maintaining the principle of least privilege.

    1. Create an access entry for the Lambda function’s execution role
      export CFN_STACK_NAME=EKS-Troubleshooter
      	   export EKS_CLUSTER=PetSite
      
      export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output text`
      
      aws eks create-access-entry 
          --cluster-name $EKS_CLUSTER 
          --principal-arn $K8SGPT_LAMBDA_ROLE
    2. Associate the EKS view policy with the access entry
      aws eks associate-access-policy 
          --cluster-name $EKS_CLUSTER 
          --principal-arn  $K8SGPT_LAMBDA_ROLE
          --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
          --access-scope type=cluster
    3. Verify the Amazon Bedrock agents. The CloudFormation template adds all three required agents. To view the agents, on the Amazon Bedrock console, under Builder tools in the navigation pane, select Agents, as shown in the following screenshot.

    Bedrock agents

    Perform Amazon EKS troubleshooting using the Amazon Bedrock agentic workflow

    Now, test the solution. We explore the following two scenarios:

    1. The agent coordinates with the K8sGPT agent to provide insights into the root cause of a pod failure
    2. The collaborator agent coordinates with the ArgoCD agent to provide a response

    Agent coordinates with K8sGPT agent to provide insights into the root cause of a pod failure

    In this section, we examine a down alert for a sample application called memory-demo. We’re interested in the root cause of the issue. We use the following prompt: “We got a down alert for the memory-demo app. Help us with the root cause of the issue.”

    The agent not only stated the root cause, but went one step further to potentially fix the error, which in this case is increasing memory resources to the application.

    K8sgpt agent finding

    Collaborator agent coordinates with ArgoCD agent to provide a response

    For this scenario, we continue from the previous prompt. We feel the application wasn’t provided enough memory, and it should be increased to permanently fix the issue. We can also tell the application is in an unhealthy state in the ArgoCD UI, as shown in the following screenshot.

    ArgoUI

    Let’s now proceed to increase the memory, as shown in the following screenshot.

    Interacting with agent to increase memory

    The agent interacted with the argocd_operations Amazon Bedrock agent and was able to successfully increase the memory. The same can be inferred in the ArgoCD UI.

    ArgoUI showing memory increase

    Cleanup

    If you decide to stop using the solution, complete the following steps:

    1. To delete the associated resources deployed using AWS CloudFormation:
      1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
      2. Locate the stack you created during the deployment process (you assigned a name to it).
      3. Select the stack and choose Delete.
    2. Delete the EKS cluster if you created one specifically for this implementation.

    Conclusion

    By orchestrating multiple Amazon Bedrock agents, we’ve demonstrated how to build an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT analysis and ArgoCD deployment automation showcases the powerful possibilities when combining specialized AI agents with existing DevOps tools. Although this solution represents advancement in automated Kubernetes operations, it’s important to remember that human oversight remains valuable, particularly for complex scenarios and strategic decisions.

    As Amazon Bedrock and its agent capabilities continue to evolve, we can expect even more sophisticated orchestration possibilities. You can extend this solution to incorporate additional tools, metrics, and automation workflows to meet your organization’s specific needs.

    To learn more about Amazon Bedrock, refer to the following resources:

    • GitHub repo: Amazon Bedrock Workshop
    • Amazon Bedrock User Guide
    • Workshop: GenAI for AWS Cloud Operations
    • Workshop: Using generative AI on AWS for diverse content types
    • Getting insight from Amazon Managed Service for Prometheus using natural language powered by Amazon Bedrock

    About the authors

    Vikram Venkataraman is a Principal Specialist Solutions Architect at Amazon Web Services (AWS). He helps customers modernize, scale, and adopt best practices for their containerized workloads. With the emergence of Generative AI, Vikram has been actively working with customers to leverage AWS’s AI/ML services to solve complex operational challenges, streamline monitoring workflows, and enhance incident response through intelligent automation.

    Puneeth Ranjan Komaragiri is a Principal Technical Account Manager at Amazon Web Services (AWS). He is particularly passionate about monitoring and observability, cloud financial management, and generative AI domains. In his current role, Puneeth enjoys collaborating closely with customers, leveraging his expertise to help them design and architect their cloud workloads for optimal scale and resilience.

    Sudheer Sangunni is a Senior Technical Account Manager at AWS Enterprise Support. With his extensive expertise in the AWS Cloud and big data, Sudheer plays a pivotal role in assisting customers with enhancing their monitoring and observability capabilities within AWS offerings.

    Vikrant Choudhary is a Senior Technical Account Manager at Amazon Web Services (AWS), specializing in healthcare and life sciences. With over 15 years of experience in cloud solutions and enterprise architecture, he helps businesses accelerate their digital transformation initiatives. In his current role, Vikrant partners with customers to architect and implement innovative solutions, from cloud migrations and application modernization to emerging technologies such as generative AI, driving successful business outcomes through cloud adoption.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBiophysical Brain Models Get a 2000× Speed Boost: Researchers from NUS, UPenn, and UPF Introduce DELSSOME to Replace Numerical Integration with Deep Learning Without Sacrificing Accuracy
    Next Article Host concurrent LLMs with LoRAX

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning

    June 3, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Facebook Banna GNU/Linux: Cosa Sta Succedendo e Alternative per la Comunità GNU/Linux

    Linux

    How to level up your Git game with GitHub CLI

    Development

    The Rise of JSON API: The Key to Seamless API Integration in Modern Technologies

    Development

    CVE-2025-4805 – WatchGuard Fireware OS Stored XSS Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Not able to run application through emulator in mac

    June 26, 2024

    Code:

    public class AppDAta {
    public static URL url;
    public static DesiredCapabilities capabilities;
    public static AndroidDriver<AndroidElement> driver;

    //1
    @Test
    public void setupAppium() throws MalformedURLException {
    //2
    final String URL_STRING = “http://127.0.0.1:4723/wd/hub”;
    url = new URL(“http://127.0.0.1:4723/wd/hub”);

    //3
    DesiredCapabilities cap= new DesiredCapabilities();
    cap.setCapability(MobileCapabilityType.DEVICE_NAME, “Nexus_4_API_27”);
    cap.setCapability(MobileCapabilityType.PLATFORM_NAME, “android”);
    cap.setCapability(MobileCapabilityType.PLATFORM_VERSION, “3.6.0”);
    cap.setCapability(MobileCapabilityType.APP, “/Users/uss/Desktop/untitled folder/Appium/src/main/java/Appium/ApiDemos-debug.apk”);
    cap.setCapability(MobileCapabilityType.NO_RESET, true);
    // cap.setCapability(MobileCapabilityType.AUTOMATION_NAME, “XCUITest”);
    cap.setCapability(“useNewWDA”, false);
    //4
    AndroidDriver<AndroidElement> driver = new AndroidDriver<AndroidElement>(url, cap);
    driver.manage().timeouts().implicitlyWait(2, TimeUnit.SECONDS);
    //driver.resetApp();
    }

    }

    Getting this type of error in console:

    FAILED: setupAppium
    org.openqa.selenium.WebDriverException: An unknown server-side error occurred while processing the command. Original error: Could not find ‘adb’ in PATH. Please set the ANDROID_HOME or ANDROID_SDK_ROOT environment variables to the corect Android SDK root directory path.
    Build info: version: ‘3.6.0’, revision: ‘6fbf3ec767’, time: ‘2017-09-27T15:28:36.4Z’
    System info: host: ‘tests-Mac-mini.local’, ip: ‘fe80:0:0:0:431:f1c:51d3:566a%en0’, os.name: ‘Mac OS X’, os.arch: ‘x86_64’, os.version: ‘10.13.6’, java.version: ‘1.8.0_111’
    Driver info: driver.version: AndroidDriver
    remote stacktrace: UnknownError: An unknown server-side error occurred while processing the command. Original error: Could not find ‘adb’ in PATH. Please set the ANDROID_HOME or ANDROID_SDK_ROOT environment variables to the corect Android SDK root directory path.

    Researchers Uncover ‘LLMjacking’ Scheme Targeting Cloud-Hosted AI Models

    May 10, 2024

    KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

    May 15, 2024

    These 6 products helped me cut ties with cable – and save $1,200 a year

    May 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.