Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

    Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

    July 26, 2024

    Implementing hardware resiliency in your training infrastructure is crucial to mitigating risks and enabling uninterrupted model training. By implementing features such as proactive health monitoring and automated recovery mechanisms, organizations can create a fault-tolerant environment capable of handling hardware failures or other issues without compromising the integrity of the training process.

    In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This component can quickly detect rare occurrences of issues when Neuron devices fail by tailing monitoring logs. It marks the worker nodes in a defective Neuron device as unhealthy, and promptly replaces them with new worker nodes. By accelerating the speed of issue detection and remediation, it increases the reliability of your ML training and reduces the wasted time and cost due to hardware failure.

    This solution is applicable if you’re using managed nodes or self-managed node groups (which use Amazon EC2 Auto Scaling groups) on Amazon EKS. At the time of writing this post, automatic recovery of nodes provisioned by Karpenter is not yet supported.

    Solution overview

    The solution is based on the node problem detector and recovery DaemonSet, a powerful tool designed to automatically detect and report various node-level problems in a Kubernetes cluster.

    The node problem detector component will continuously monitor the kernel message (kmsg) logs on the worker nodes. If it detects error messages specifically related to the Neuron device (which is the Trainium or AWS Inferentia chip), it will change NodeCondition to NeuronHasError on the Kubernetes API server.

    The node recovery agent is a separate component that periodically checks the Prometheus metrics exposed by the node problem detector. When it finds a node condition indicating an issue with the Neuron device, it will take automated actions. First, it will mark the affected instance in the relevant Auto Scaling group as unhealthy, which will invoke the Auto Scaling group to stop the instance and launch a replacement. Additionally, the node recovery agent will publish Amazon CloudWatch metrics for users to monitor and alert on these events.

    The following diagram illustrates the solution architecture and workflow.

    In the following walkthrough, we create an EKS cluster with Trn1 worker nodes, deploy the Neuron plugin for the node problem detector, and inject an error message into the node. We then observe the failing node being stopped and replaced with a new one, and find a metric in CloudWatch indicating the error.

    Prerequisites

    Before you start, make sure you have installed the following tools on your machine:

    The latest version of the AWS Command Line Interface (AWS CLI)
    eksctl
    kubectl
    Terraform
    The Session Manager plugin

    Deploy the node problem detection and recovery plugin

    Complete the following steps to configure the node problem detection and recovery plugin:

    Create an EKS cluster using the data on an EKS Terraform module:

    git clone https://github.com/awslabs/data-on-eks.git

    export TF_VAR_region=us-east-2
    export TF_VAR_trn1_32xl_desired_size=4
    export TF_VAR_trn1_32xl_min_size=4
    cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x install.sh
    ./install.sh

    aws eks –region us-east-2 describe-cluster –name trainium-inferentia

    # Creates k8s config file to authenticate with EKS
    aws eks –region us-east-2 update-kubeconfig –name trainium-inferentia

    kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    ip-100-64-161-213.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
    ip-100-64-227-31.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
    ip-100-64-70-179.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde

    Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.
    Create a policy as shown below. Update the Resource key value to match your node group ARN that contains the Trainium and AWS Inferentia nodes, and update the ec2:ResourceTag/aws:autoscaling:groupName key value to match the Auto Scaling group name.

    You can get these values from the Amazon EKS console. Choose Clusters in the navigation pane, open the trainium-inferentia cluster, choose Node groups, and locate your node group.

    # To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

    # Create npd-policy-trimmed.json
    cat << EOF > npd-policy-trimmed.json
    {
    “Version”: “2012-10-17”,
    “Statement”: [
    {
    “Action”: [
    “autoscaling:SetInstanceHealth”,
    “autoscaling:DescribeAutoScalingInstances”
    ],
    “Effect”: “Allow”,
    “Resource”: <arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
    },
    {
    “Action”: [
    “ec2:DescribeInstances”
    ],
    “Effect”: “Allow”,
    “Resource”: “*”,
    “Condition”: {
    “ForAllValues:StringEquals”: {
    “ec2:ResourceTag/aws:autoscaling:groupName”: <name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
    }
    }
    },
    {
    “Action”: [
    “cloudwatch:PutMetricData”
    ],
    “Effect”: “Allow”,
    “Resource”: “*”,
    “Condition”: {
    “StringEquals”: {
    “cloudwatch:Namespace”: “NeuronHealthCheck”
    }
    }
    }
    ]
    }
    EOF

    This component will be installed as a DaemonSet in your EKS cluster.

    # To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

    aws iam create-policy
    –policy-name NeuronProblemDetectorPolicy
    –policy-document file://npd-policy-trimmed.json

    # Note the ARN

    CLUSTER_NAME=trainium-inferentia # Your EKS Cluster Name
    AWS_REGION=us-east-2
    ACCOUNT_ID=$(aws sts get-caller-identity –query ‘Account’ –output text)
    POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:policy/NeuronProblemDetectorPolicy

    eksctl create addon –cluster $CLUSTER_NAME –name eks-pod-identity-agent
    –region $AWS_REGION

    eksctl create podidentityassociation
    –cluster $CLUSTER_NAME
    –namespace neuron-healthcheck-system
    –service-account-name node-problem-detector
    –permission-policy-arns=”$POLICY_ARN”
    –region $AWS_REGION

    # Install the Neuron NPD and recovery plugin

    kubectl create ns neuron-healthcheck-system
    curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f –
    curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f –
    curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f –

    # Expected result (with 4 Neuron nodes in cluster):

    kubectl get pod -n neuron-healthcheck-system
    NAME READY STATUS RESTARTS AGE
    node-problem-detector-49p6w 2/2 Running 0 31s
    node-problem-detector-j7wct 2/2 Running 0 31s
    node-problem-detector-qr6jm 2/2 Running 0 31s
    node-problem-detector-vwq8x 2/2 Running 0 31s

    The container images in the Kubernetes manifests are stored in public repository such as registry.k8s.io and public.ecr.aws. For production environments, it’s recommended that customers limit external dependencies that impact these areas and host container images in a private registry and sync from images public repositories. For detailed implementation, please refer to the blog post: Announcing pull through cache for registry.k8s.io in Amazon Elastic Container Registry.

    By default, the node problem detector will not take any actions on failed node. If you would like the EC2 instance to be terminated automatically by the agent, update the DaemonSet as follows:

    kubectl edit -n neuron-healthcheck-system ds/node-problem-detector

    …
    env:
    – name: ENABLE_RECOVERY
    value: “true”

    Test the node problem detector and recovery solution

    After the plugin is installed, you can see Neuron conditions show up by running kubectl describe node. We simulate a device error by injecting error logs in the instance:

    # Verify node conditions on any node. Neuron conditions should show up.

    kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep Conditions: -A7

    Conditions:
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      —-             ——  —————–                 ——————                ——                       ——-
      NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error
      MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting ready status
    # To get provider id
    kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep -i provider | sed -E ‘s/.*/([^/]+)$/1/’

    i-0381404aa69eae3f6

    # SSH into to the worker node and simulate the hardware error on the neuron device
    aws ssm start-session –target i-0381404aa69eae3f6 –region us-east-2

    Starting session with SessionId: lindarr-0069460593240662a

    sh-4.2$
    sh-4.2$ sudo bash
    [root@ip-192-168-93-211 bin]# echo “test NEURON_HW_ERR=DMA_ERROR test” >> /dev/kmsg

    Around 2 minutes later, you can see that the error has been identified:

    kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep ‘Conditions:’ -A7

    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message
    —- —— —————– —————— —— ——-
    NeuronHealth True Fri, 29 Mar 2024 17:42:43 +0800 Fri, 29 Mar 2024 17:42:38 +0800 NeuronHasError_DMA_ERROR test NEURON_HW_ERR=DMA_ERROR test

    …

    Events:
    Type Reason Age From Message
    —- —— —- —- ——-
    Warning NeuronHasError_DMA_ERROR 36s kernel-monitor Node condition NeuronHealth is now: True, reason: NeuronHasError_DMA_ERROR, message: “test NEURON_HW_ERR=DMA_ERROR test”

    Now that the error has been detected by the node problem detector, and the recovery agent has automatically taken the action to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:

    # Verify the Node scheduling is disabled.
    kubectl get node
    NAME STATUS ROLES AGE VERSION
    ip-100-64-1-48.us-east-2.compute.internal Ready <none> 156m v1.29.0-eks-5e0fdde
    ip-100-64-103-26.us-east-2.compute.internal Ready <none> 94s v1.29.0-eks-5e0fdde
    ip-100-64-239-245.us-east-2.compute.internal Ready <none> 154m v1.29.0-eks-5e0fdde
    ip-100-64-52-40.us-east-2.compute.internal Ready <none> 156m v1.29.0-eks-5e0fdde
    ip-100-64-58-151.us-east-2.compute.internal NotReady,SchedulingDisabled <none> 27h v1.29.0-eks-5e0fdde

    You can open the CloudWatch console and verify the metric for NeuronHealthCheck. You can see the CloudWatch NeuronHasError_DMA_ERROR metric has the value 1.

    After replacement, you can see a new worker node has been created:

    # The new node with age 28s is the new node

    kubectl get node
    NAME STATUS ROLES AGE VERSION
    ip-192-168-65-77.us-east-2.compute.internal Ready <none> 28s v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde
    ip-192-168-81-176.us-east-2.compute.internal Ready <none> 9d v1.29.5-eks-5e0fdde
    ip-192-168-91-218.us-east-2.compute.internal Ready <none> 9d v1.29.0-eks-5e0fdde
    ip-192-168-94-83.us-east-2.compute.internal Ready <none> 9d v1.29.0-eks-5e0fdde

    Let’s look at a real-world scenario, in which you’re running a distributed training job, using an MPI operator as outlined in Llama-2 on Trainium, and there is an irrecoverable Neuron error in one of the nodes. Before the plugin is deployed, the training job will become stuck, resulting in wasted time and computational costs. With the plugin deployed, the node problem detector will proactively remove the problem node from the cluster. In the training scripts, it saves checkpoints periodically so that the training will resume from the previous checkpoint.

    The following screenshot shows example logs from a distributed training job.

    The training has been started. (You can ignore loss=nan for now; it’s a known issue and will be removed. For immediate use, refer to the reduced_train_loss metric.)

    The following screenshot shows the checkpoint created at step 77.

    Training stopped after one of the nodes has a problem at step 86. The error was injected manually for testing.

    After the faulty node was detected and replaced by the Neuron plugin for node problem and recovery, the training process resumed at step 77, which was the last checkpoint.

    Although Auto Scaling groups will stop unhealthy nodes, they may encounter issues preventing the launch of replacement nodes. In such cases, training jobs will stall and require manual intervention. However, the stopped node will not incur further charges on the associated EC2 instance.

    If you want to take custom actions in addition to stopping instances, you can create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights query like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to evaluate the alarms. The following screenshots show an example.

    Clean up

    To clean up all the provisioned resources for this post, run the cleanup script:

    # neuron-problem-detector-role-$CLUSTER_NAME
    eksctl delete podidentityassociation
    –service-account-name node-problem-detector
    –namespace neuron-healthcheck-system
    –cluster $CLUSTER_NAME
    –region $AWS_REGION

    # delete the EKS Cluster
    cd data-on-eks/ai-ml/trainium-inferentia
    ./cleanup.sh

    Conclusion

    In this post, we showed how the Neuron problem detector and recovery DaemonSet for Amazon EKS works for EC2 instances powered by Trainium and AWS Inferentia. If you’re running Neuron based EC2 instances and using managed nodes or self-managed node groups, you can deploy the detector and recovery DaemonSet in your EKS cluster and benefit from improved reliability and fault tolerance of your machine learning training workloads in the event of node failure.

    About the authors

    Harish Rao is a senior solutions architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

    Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.

    Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.

    Darren Lin is a Cloud Native Specialist Solutions Architect at AWS who focuses on domains such as Linux, Kubernetes, Container, Observability, and Open Source Technologies. In his spare time, he likes to work out and have fun with his family.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleEvaluate conversational AI agents with Amazon Bedrock
    Next Article Mistral Large 2 is now available in Amazon Bedrock

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    1200 Free Computer Science Courses from the World’s Top Universities

    Development

    Top 15+ GPU Server Hosting Providers in 2025

    Development

    Perficient Joins Salesforce’s Agentforce Partner Network

    Development

    Empirical evidence for code modularity

    Development

    Highlights

    Improving JavaScript Speed and Responsiveness

    May 3, 2024

    JavaScript is a versatile and powerful language used for building interactive web applications. However, as applications…

    Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framework Transforming Autonomous GUI Interaction Across Platforms

    December 24, 2024

    What is MSConfig, and how do you use it on Windows 11? System Configuration explained.

    February 20, 2025

    Kathryn Thornton: Correcting Hubble’s vision | Starmus Highlights

    November 22, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.