Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 15, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 15, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 15, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 15, 2025

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025

      Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

      May 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      A cross-platform Markdown note-taking application

      May 15, 2025
      Recent

      A cross-platform Markdown note-taking application

      May 15, 2025

      AI Assistant Demo & Tips for Enterprise Projects

      May 15, 2025

      Celebrating Global Accessibility Awareness Day (GAAD)

      May 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025
      Recent

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

    AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

    May 2, 2024

    Today, we’re excited to announce the availability of Meta Llama 3 inference on AWS Trainium and AWS Inferentia based instances in Amazon SageMaker JumpStart. The Meta Llama 3 models are a collection of pre-trained and fine-tuned generative text models. Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances, powered by AWS Trainium and AWS Inferentia2, provide the most cost-effective way to deploy Llama 3 models on AWS. They offer up to 50% lower cost to deploy than comparable Amazon EC2 instances. They not only reduce the time and expense involved in training and deploying large language models (LLMs), but also provide developers with easier access to high-performance accelerators to meet the scalability and efficiency needs of real-time applications, such as chatbots and AI assistants.

    In this post, we demonstrate how easy it is to deploy Llama 3 on AWS Trainium and AWS Inferentia based instances in SageMaker JumpStart.

    Meta Llama 3 model on SageMaker Studio

    SageMaker JumpStart provides access to publicly available and proprietary foundation models (FMs). Foundation models are onboarded and maintained from third-party and proprietary providers. As such, they are released under different licenses as designated by the model source. Be sure to review the license for any FM that you use. You are responsible for reviewing and complying with applicable license terms and making sure they are acceptable for your use case before downloading or using the content.

    You can access the Meta Llama 3 FMs through SageMaker JumpStart on the Amazon SageMaker Studio console and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

    SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all machine learning (ML) development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Get Started with SageMaker Studio.

    On the SageMaker Studio console, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane. If you’re using SageMaker Studio Classic, refer to Open and use JumpStart in Studio Classic to navigate to the SageMaker JumpStart models.

    From the SageMaker JumpStart landing page, you can search for “Meta” in the search box.

    Choose the Meta model card to list all the models from Meta on SageMaker JumpStart.

    You can also find relevant model variants by searching for “neuron.” If you don’t see Meta Llama 3 models, update your SageMaker Studio version by shutting down and restarting SageMaker Studio.

    No-code deployment of the Llama 3 Neuron model on SageMaker JumpStart

    You can choose the model card to view details about the model, such as the license, data used to train, and how to use it. You can also find two buttons, Deploy and Preview notebooks, which help you deploy the model.

    When you choose Deploy, the page shown in the following screenshot appears. The top section of the page shows the end-user license agreement (EULA) and acceptable use policy for you to acknowledge.

    After you acknowledge the policies, provide your endpoint settings and choose Deploy to deploy the endpoint of the model.

    Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

    Meta Llama 3 deployment on AWS Trainium and AWS Inferentia using the SageMaker JumpStart SDK

    In SageMaker JumpStart, we have pre-compiled the Meta Llama 3 model for a variety of configurations to avoid runtime compilation during deployment and fine-tuning. The Neuron Compiler FAQ has more details about the compilation process.

    There are two ways to deploy Meta Llama 3 on AWS Inferentia and Trainium based instances using the SageMaker JumpStart SDK. You can deploy the model with two lines of code for simplicity, or focus on having more control of the deployment configurations. The following code snippet shows the simpler mode of deployment:

    from sagemaker.jumpstart.model import JumpStartModel

    model_id = “meta-textgenerationneuron-llama-3-8b”
    accept_eula = True
    model = JumpStartModel(model_id=model_id)
    predictor = model.deploy(accept_eula=accept_eula) ## To set ‘accept_eula’ to be True to deploy

    To perform inference on these models, you need to specify the argument accept_eula as True as part of the model.deploy() call. This means you have read and accepted the EULA of the model. The EULA can be found in the model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/.

    The default instance type for Meta LIama-3-8B is is ml.inf2.24xlarge. The other supported model IDs for deployment are the following:

    meta-textgenerationneuron-llama-3-70b
    meta-textgenerationneuron-llama-3-8b-instruct
    meta-textgenerationneuron-llama-3-70b-instruct

    SageMaker JumpStart has pre-selected configurations that can help get you started, which are listed in the following table. For more information about optimizing these configurations further, refer to advanced deployment configurations

    LIama-3 8B and LIama-3 8B Instruct

    Instance type

    OPTION_N_POSITI

    ONS

    OPTION_MAX_ROLLING_BATCH_SIZE
    OPTION_TENSOR_PARALLEL_DEGREE
    OPTION_DTYPE

    ml.inf2.8xlarge
    8192
    1
    2
    bf16

    ml.inf2.24xlarge (Default)
    8192
    1
    12
    bf16

    ml.inf2.24xlarge
    8192
    12
    12
    bf16

    ml.inf2.48xlarge
    8192
    1
    24
    bf16

    ml.inf2.48xlarge
    8192
    12
    24
    bf16

    LIama-3 70B and LIama-3 70B Instruct

    ml.trn1.32xlarge
    8192
    1
    32
    bf16

    ml.trn1.32xlarge
    (Default)
    8192
    4
    32
    bf16

    The following code shows how you can customize deployment configurations such as sequence length, tensor parallel degree, and maximum rolling batch size:

    from sagemaker.jumpstart.model import JumpStartModel

    model_id = “meta-textgenerationneuron-llama-3-70b”
    model = JumpStartModel(
    model_id=model_id,
    env={
    “OPTION_DTYPE”: “bf16”,
    “OPTION_N_POSITIONS”: “8192”,
    “OPTION_TENSOR_PARALLEL_DEGREE”: “32”,
    “OPTION_MAX_ROLLING_BATCH_SIZE”: “4”,
    },
    instance_type=”ml.trn1.32xlarge”
    )
    ## To set ‘accept_eula’ to be True to deploy
    pretrained_predictor = model.deploy(accept_eula=False)

    Now that you have deployed the Meta Llama 3 neuron model, you can run inference from it by invoking the endpoint:

    payload = {
    “inputs”: “I believe the meaning of life is”,
    “parameters”: {
    “max_new_tokens”: 64,
    “top_p”: 0.9,
    “temperature”: 0.6,
    },
    }

    response = pretrained_predictor.predict(payload)

    Output:

    I believe the meaning of life is
    > to be happy. I believe that happiness is a choice. I believe that happiness
    is a state of mind. I believe that happiness is a state of being. I believe that
    happiness is a state of being. I believe that happiness is a state of being. I
    believe that happiness is a state of being. I believe

    For more information on the parameters in the payload, refer to Detailed parameters.

    Refer to Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium for details on how to pass the parameters to control text generation.

    Clean up

    After you have completed your training job and don’t want to use the existing resources anymore, you can delete the resources using the following code:

    # Delete resources
    # Delete the fine-tuned model
    predictor.delete_model()

    # Delete the fine-tuned model endpoint
    predictor.delete_endpoint()

    Conclusion

    The deployment of Meta Llama 3 models on AWS Inferentia and AWS Trainium using SageMaker JumpStart demonstrates the lowest cost for deploying large-scale generative AI models like Llama 3 on AWS. These models, including variants like Meta-Llama-3-8B, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B, and Meta-Llama-3-70B-Instruct, use AWS Neuron for inference on AWS Trainium and Inferentia. AWS Trainium and Inferentia offer up to 50% lower cost to deploy than comparable EC2 instances.

    In this post, we demonstrated how to deploy Meta Llama 3 models on AWS Trainium and AWS Inferentia using SageMaker JumpStart. The ability to deploy these models through the SageMaker JumpStart console and Python SDK offers flexibility and ease of use. We are excited to see how you use these models to build interesting generative AI applications.

    To start using SageMaker JumpStart, refer to Getting started with Amazon SageMaker JumpStart. For more examples of deploying models on AWS Trainium and AWS Inferentia, see the GitHub repo. For more information on deploying Meta Llama 3 models on GPU-based instances, see Meta Llama 3 models are now available in Amazon SageMaker JumpStart.

    About the Authors

    Xin Huang is a Senior Applied Scientist
    Rachna Chadha is a Principal Solutions Architect – AI/ML
    Qing Lan is a Senior SDE – ML System
    Pinak Panigrahi is a Senior Solutions Architect Annapurna ML
    Christopher Whitten is a Software Development Engineer
    Kamran Khan is a Head of BD/GTM Annapurna ML
    Ashish Khetan is a Senior Applied Scientist
    Pradeep Cruz is a Senior SDM

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIterative Preference Optimization for Improving Reasoning Tasks in Language Models
    Next Article madison.clark@salesfacilitation.com

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Microsoft doubles down on its AI efforts with a massive $80 billion investment in data centers — amid insider concerns most Copilot AI tools are seemingly “gimmicky”

    Development

    Google Simplifies 2-Factor Authentication Setup (It’s More Important Than Ever)

    Development

    Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude Code

    Machine Learning

    How underperforming UX may be affecting your customer churn rate

    Development

    Highlights

    Why Checking response.ok in Fetch API Matters for Reliable Code

    December 30, 2024

    Comments Source: Read More 

    Elon Musk Chill Guy Shirt

    November 21, 2024

    This month in security with Tony Anscombe – April 2025 edition

    April 30, 2025

    Maximize Compatibility: Testing Your Website in visionOS with Safari

    August 31, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.