Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    April 15, 2025

    Organizations are constantly seeking ways to harness the power of advanced large language models (LLMs) to enable a wide range of applications such as text generation, summarizationquestion answering, and many others. As these models grow more powerful and capable, deploying them in production environments while optimizing performance and cost-efficiency becomes more challenging.

    Amazon Web Services (AWS) provides highly optimized and cost-effective solutions for deploying AI models, like the Mixtral 8x7B language model, for inference at scale. The AWS Inferentia and AWS Trainium are AWS AI chips, purpose-built to deliver high throughput and low latency inference and training performance for even the largest deep learning models. The Mixtral 8x7B model adopts the Mixture-of-Experts (MoE) architecture with eight experts. AWS Neuron—the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances—employs expert parallelism for MoE architecture, sharding the eight experts across multiple NeuronCores.

    This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face. This will be followed by deployment to an Amazon SageMaker real-time inference endpoint, which automatically provisions and manages the Inferentia2 instances behind the scenes and provides a containerized environment to run the model securely and at scale.

    While pre-compiled model versions exist, we’ll cover the compilation process to illustrate important configuration options and instance sizing considerations. This end-to-end guide combines Amazon Elastic Compute Cloud (Amazon EC2)-based compilation with SageMaker deployment to help you use Mixtral 8x7B’s capabilities with optimal performance and cost efficiency.

    Step 1: Set up Hugging Face access

    Before you can deploy the Mixtral 8x7B model, there some prerequisites that you need to have in place.

    • The model is hosted on Hugging Face and uses their transformers library. To download and use the model, you need to authenticate with Hugging Face using a user access token. These tokens allow secure access for applications and notebooks to Hugging Face’s services. You first need to create a Hugging Face account if you don’t already have one, which you can then use to generate and manage your access tokens through the user settings.
    • The mistralai/Mixtral-8x7B-Instruct-v0.1 model that you will be working with in this post is a gated model. This means that you need to specifically request access from Hugging Face before you can download and work with the model.

    Step 2: Launch an Inferentia2-powered EC2 Inf2 instance

    To get started with an Amazon EC2 Inf2 instance for deploying the Mixtral 8x7B, either deploy the AWS CloudFormation template or use the AWS Management Console.

    To launch an Inferentia2 instance using the console:

    1. Navigate to the Amazon EC2 console and choose Launch Instance.
    2. Enter a descriptive name for your instance.
    3. Under the Application and OS Images search for and select the Hugging Face Neuron Deep Learning AMI, which comes pre-configured with the Neuron software stack for AWS Inferentia.
    4. For Instance type, select 24xlarge, which contains six Inferentia chips (12 NeuronCores).
    5. Create or select an existing key pair to enable SSH access.
    6. Create or select a security group that allows inbound SSH connections from the internet.
    7. Under Configure Storage, set the root EBS volume to 512 GiB to accommodate the large model size.
    8. After the settings are reviewed, choose Launch Instance.

    With your Inf2 instance launched, connect to it over SSH by first locating the public IP or DNS name in the Amazon EC2 console. Later in this post, you will connect to a Jupyter notebook using a browser on port 8888. To do that, SSH tunnel to the instance using the key pair you configured during instance creation.

    ssh -i "<pem file>" ubuntu@<instance DNS name> -L 8888:127.0.0.1:8888

    After signing in, list the NeuronCores attached to the instance and their associated topology:

    neuron-ls

    For inf2.24xlarge, you should see the following output listing six Neuron devices:

    instance-type: inf2.24xlarge
    instance-id: i-...
    +--------+--------+--------+-----------+---------+
    | NEURON | NEURON | NEURON | CONNECTED |   PCI   |
    | DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
    +--------+--------+--------+-----------+---------+
    | 0      | 2      | 32 GB  | 1         | 10:1e.0 |
    | 1      | 2      | 32 GB  | 0, 2      | 20:1e.0 |
    | 2      | 2      | 32 GB  | 1, 3      | 10:1d.0 |
    | 3      | 2      | 32 GB  | 2, 4      | 20:1f.0 |
    | 4      | 2      | 32 GB  | 3, 5      | 10:1f.0 |
    | 5      | 2      | 32 GB  | 4         | 20:1d.0 |
    +--------+--------+--------+-----------+---------+

    For more information on the neuron-ls command, see the Neuron LS User Guide.

    Make sure the Inf2 instance is sized correctly to host the model. Each Inferentia NeuronCore processor contains 16 GB of high-bandwidth memory (HBM). To accommodate an LLM like the Mixtral 8x7B on AWS Inferentia2 (inf2) instances, a technique called tensor parallelism is used. This allows the model’s weights, activations, and computations to be split and distributed across multiple NeuronCores in parallel. To determine the degree of tensor parallelism required, you need to calculate the total memory footprint of the model. This can be computed as:

    total memory = bytes per parameter * number of parameters

    The Mixtral-8x7B model consists of 46.7 billion parameters. With float16 casted weights, you need 93.4 GB to store the model weights. The total space required is often greater than just the model parameters because of caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. With a batch size of 1 and a sequence length of 1024 tokens, the total memory footprint for the caching is 0.5 GB. The exact formula can be found in the AWS Neuron documentation and the hyper-parameter configuration required for these calculations is stored in the model config.json file.

    Given that each NeuronCore has 16 GB of HBM, and the model requires approximately 94 GB of memory, a minimum tensor parallelism degree of 6 would theoretically suffice. However, with 32 attention heads, the tensor parallelism degree must be a divisor of this number.

    Furthermore, considering the model’s size and the MoE implementation in transformers-neuronx, the supported tensor parallelism degrees are limited to 8, 16, and 32. For the example in this post, you will distribute the model across eight NeuronCores.

    Compile Mixtral-8x7B model to AWS Inferentia2

    The Neuron SDK includes a specialized compiler that automatically optimizes the model format for efficient execution on AWS Inferentia2.

    1. To start this process, launch the container and pass the Inferentia devices to the container. For more information about launching the neuronx-tgi container see Deploy the Text Generation Inference (TGI) Container on a dedicated host.
    docker run -it --entrypoint /bin/bash 
      --net=host -v $(pwd):$(pwd) -w $(pwd) 
      --device=/dev/neuron0 
      --device=/dev/neuron1 
      --device=/dev/neuron2 
      --device=/dev/neuron3 
      --device=/dev/neuron4 
      --device=/dev/neuron5 
      ghcr.io/huggingface/neuronx-tgi:0.0.25
    1. Inside the container, sign in to the Hugging Face Hub to access gated models, such as the Mixtral-8x7B-Instruct-v0.1. See the previous section for Setup Hugging Face Access. Make sure to use a token with read and write permissions so you can later save the compiled model to the Hugging Face Hub.
    huggingface-cli login --token hf_...
    1. After signing in, compile the model with optimum-cli. This process will download the model artifacts, compile the model, and save the results in the specified directory.
    2. The Neuron chips are designed to execute models with fixed input shapes for optimal performance. This requires that the compiled artifact shapes must be known at compilation time. In the following command, you will set the batch size, input/output sequence length, data type, and tensor-parallelism degree (number of neuron cores). For more information about these parameters, see Export a model to Inferentia.

    Let’s discuss these parameters in more detail:

    Hostinger
    • The parameter batch_size is the number of input sequences that the model will accept.
    • sequence_length specifies the maximum number of tokens in an input sequence. This affects memory usage and model performance during inference or training on Neuron hardware. A larger number will increase the model’s memory requirements because the attention mechanism needs to operate over the entire sequence, which leads to more computations and memory usage; while a smaller number will do the opposite. The value 1024 will be adequate for this example.
    • auto_cast_type parameter controls quantization. It allows type casting for model weights and computations during inference. The options are: bf16, fp16, or tf32. For more information about defining which lower-precision data type the compiler should use see Mixed Precision and Performance-accuracy Tuning. For models trained in float32, the 16-bit mixed precision options (bf16, f16) generally provide sufficient accuracy while significantly improving performance. We use data type float16 with the argument auto_cast_type fp16.
    • The num_cores parameter controls the number of cores on which the model should be deployed. This will dictate the number of parallel shards or partitions the model is split into. Each shard is then executed on a separate NeuronCore, taking advantage of the 16 GB high-bandwidth memory available per core. As discussed in the previous section, given the Mixtral-8x7B model’s requirements, Neuron supports 8, 16, or 32 tensor parallelism The inf2.24xlarge instance contains 12 Inferentia NeuronCores. Therefore, to optimally distribute the model, we set num_cores to 8.
    optimum-cli export neuron 
      --model mistralai/Mixtral-8x7B-Instruct-v0.1 
      --batch_size 1 
      --sequence_length 1024 
      --auto_cast_type fp16 
      --num_cores 8 
      ./neuron_model_path
    1. Download and compilation should take 10–20 minutes. After the compilation completes successfully, you can check the artifacts created in the output directory:
    neuron_model_path
    ├── compiled
    │ ├── 2ea52780bf51a876a581.neff
    │ ├── 3fe4f2529b098b312b3d.neff
    │ ├── ...
    │ ├── ...
    │ ├── cfda3dc8284fff50864d.neff
    │ └── d6c11b23d8989af31d83.neff
    ├── config.json
    ├── generation_config.json
    ├── special_tokens_map.json
    ├── tokenizer.json
    ├── tokenizer.model
    └── tokenizer_config.json
    1. Push the compiled model to the Hugging Face Hub with the following command. Make sure to change <user_id> to your Hugging Face username. If the model repository doesn’t exist, it will be created automatically. Alternatively, store the model on Amazon Simple Storage Service (Amazon S3).

    huggingface-cli upload <user_id>/Mixtral-8x7B-Instruct-v0.1 ./neuron_model_path ./

    Deploy Mixtral-8x7B SageMaker real-time inference endpoint

    Now that the model has been compiled and stored, you can deploy it for inference using SageMaker. To orchestrate the deployment, you will run Python code from a notebook hosted on an EC2 instance. You can use the instance created in the first section or create a new instance. Note that this EC2 instance can be of any type (for example t2.micro with an Amazon Linux 2023 image). Alternatively, you can use a notebook hosted in Amazon SageMaker Studio.

    Set up AWS authorization for SageMaker deployment

    You need AWS Identity and Access Management (IAM) permissions to manage SageMaker resources. If you created the instance with the provided CloudFormation template, these permissions are already created for you. If not, the following section takes you through the process of setting up the permissions for an EC2 instance to run a notebook that deploys a real-time SageMaker inference endpoint.

    Create an AWS IAM role and attach SageMaker permission policy

    1. Go to the IAM console.
    2. Choose the Roles tab in the navigation pane.
    3. Choose Create role.
    4. Under Select trusted entity, select AWS service.
    5. Choose Use case and select EC2.
    6. Select EC2 (Allows EC2 instances to call AWS services on your behalf.)
    7. Choose Next: Permissions.
    8. In the Add permissions policies screen, select AmazonSageMakerFullAccess and IAMReadOnlyAccess. Note that the AmazonSageMakerFullAccess permission is overly permissive. We use it in this example to simplify the process but recommend applying the principle of least privilege when setting up IAM permissions.
    9. Choose Next: Review.
    10. In the Role name field, enter a role name.
    11. Choose Create role to complete the creation.
    12. With the role created, choose the Roles tab in the navigation pane and select the role you just created.
    13. Choose the Trust relationships tab and then choose Edit trust policy.
    14. Choose Add next to Add a principal.
    15. For Principal type, select AWS services.
    16. Enter sagemaker.amazonaws.com and choose Add a principal.
    17. Choose Update policy. Your trust relationship should look like the following:
    {
        "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "ec2.amazonaws.com",
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
            }
        ]
    }

    Attach the IAM role to your EC2 instance

    1. Go to the Amazon EC2 console.
    2. Choose Instances in the navigation pane.
    3. Select your EC2 instance.
    4. Choose Actions, Security, and then Modify IAM role.
    5. Select the role you created in the previous step.
    6. Choose Update IAM role.

    Launch a Jupyter notebook

    Your next goal is to run a Jupyter notebook hosted in a container running on the EC2 instance. The notebook will be run using a browser on port 8888 by default. For this example, you will use SSH port forwarding from your local machine to the instance to access the notebook.

    1. Continuing from the previous section, you are still within the container. The following steps install Jupyter Notebook:
    pip install ipykernel
    python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python Neuronx"
    pip install jupyter notebook
    pip install environment_kernels
    1. Launch the notebook server using:
    jupyter notebook
    1. Then connect to the notebook using your browser over SSH tunneling

    http://localhost:8888/tree?token=…

    If you get a blank screen, try opening this address using your browser’s incognito mode.

    Deploy the model for inference with SageMaker

    After connecting to Jupyter Notebook, follow this notebook. Alternatively, choose File, New,  Notebook, and then select Python 3 as the kernel. Use the following instructions and run the notebook cells.

    1. In the notebook, install the sagemaker and huggingface_hub libraries.
    !pip install sagemaker
    1. Next, get a SageMaker session and execution role that will allow you to create and manage SageMaker resources. You’ll use a Deep Learning Container.
    import os
    import sagemaker
    from sagemaker.huggingface import get_huggingface_llm_image_uri
    
    os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
    
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
    print(f"sagemaker role arn: {role}")
    
    # retrieve the llm image uri
    llm_image = get_huggingface_llm_image_uri(
    	"huggingface-neuronx",
    	version="0.0.25"
    )
    
    # print ecr image uri
    print(f"llm image uri: {llm_image}")
    
    1. Deploy the compiled model to a SageMaker real-time endpoint on AWS Inferentia2.

    Change user_id in the following code to your Hugging Face username. Make sure to update HF_MODEL_ID and HUGGING_FACE_HUB_TOKEN with your Hugging Face username and your access token.

    from sagemaker.huggingface import HuggingFaceModel
    
    # sagemaker config
    instance_type = "ml.inf2.24xlarge"
    health_check_timeout=2400 # additional time to load the model
    volume_size=512 # size in GB of the EBS volume
    
    # Define Model and Endpoint configuration parameter
    config = {
    	"HF_MODEL_ID": "user_id/Mixtral-8x7B-Instruct-v0.1", # replace with your model id if you are using your own model
    	"HF_NUM_CORES": "4", # number of neuron cores
    	"HF_AUTO_CAST_TYPE": "fp16",  # dtype of the model
    	"MAX_BATCH_SIZE": "1", # max batch size for the model
    	"MAX_INPUT_LENGTH": "1000", # max length of input text
    	"MAX_TOTAL_TOKENS": "1024", # max length of generated text
    	"MESSAGES_API_ENABLED": "true", # Enable the messages API
    	"HUGGING_FACE_HUB_TOKEN": "hf_..." # Add your Hugging Face token here
    }
    
    # create HuggingFaceModel with the image uri
    llm_model = HuggingFaceModel(
    	role=role,
    	image_uri=llm_image,
    	env=config
    )
    
    1. You’re now ready to deploy the model to a SageMaker real-time inference endpoint. SageMaker will provision the necessary compute resources instance and retrieve and launch the inference container. This will download the model artifacts from your Hugging Face repository, load the model to the Inferentia devices and start inference serving. This process can take several minutes.
    # Deploy model to an endpoint
    # https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
    
    llm_model._is_compiled_model = True # We precompiled the model
    
    llm = llm_model.deploy(
    	initial_instance_count=1,
    	instance_type=instance_type,
    	container_startup_health_check_timeout=health_check_timeout,
    	volume_size=volume_size
    )
    1. Next, run a test to check the endpoint. Update user_id to match your Hugging Face username, then create the prompt and parameters.
    # Prompt to generate
    messages=[
    	{ "role": "system", "content": "You are a helpful assistant." },
    	{ "role": "user", "content": "What is deep learning?" }
    ]
    
    # Generation arguments
    parameters = {
    	"model": "user_id/Mixtral-8x7B-Instruct-v0.1", # replace user_id
    	"top_p": 0.6,
    	"temperature": 0.9,
    	"max_tokens": 1000,
    }
    1. Send the prompt to the SageMaker real-time endpoint for inference
    chat = llm.predict({"messages" :messages, **parameters})
    
    print(chat["choices"][0]["message"]["content"].strip())
    1. In the future, if you want to connect to this inference endpoint from other applications, first find the name of the inference endpoint. Alternatively, you can use the SageMaker console and choose Inference, and then Endpoints to see a list of the SageMaker endpoints deployed in your account.
    endpoints = sess.sagemaker_client.list_endpoints()
    
    for endpoint in endpoints['Endpoints']:
    	print(endpoint['EndpointName'])
    1. Use the endpoint name to update the following code, which can also be run in other locations.
    from sagemaker.huggingface import HuggingFacePredictor
    
    endpoint_name="endpoint_name..."
    
    llm = HuggingFacePredictor(
    	endpoint_name=endpoint_name,
    	sagemaker_session=sess
    )

    Cleanup

    Delete the endpoint to prevent future charges for the provisioned resources.

    llm.delete_model()
    llm.delete_endpoint()
    

    Conclusion

    In this post, we covered how to compile and deploy the Mixtral 8x7B language model on AWS Inferentia2 using the Hugging Face Optimum Neuron container and Amazon SageMaker. AWS Inferentia2 offers a cost-effective solution for hosting models like Mixtral, providing high-performance inference at a lower cost.

    For more information, see Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum.

    For other methods to compile and run Mixtral inference on Inferentia2 and Trainium see the Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 tutorial located in the AWS Neuron Documentation and Notebook.


    About the authors

    Headshot of Lior Sadan (author)Lior Sadan is a Senior Solutions Architect at AWS, with an affinity for storage solutions and AI/ML implementations. He helps customers architect scalable cloud systems and optimize their infrastructure. Outside of work, Lior enjoys hands-on home renovation and construction projects.

    Headshot of Stenio de Lima Ferreira (author)Stenio de Lima Ferreira is a Senior Solutions Architect passionate about AI and automation. With over 15 years of work experience in the field, he has a background in cloud infrastructure, devops and data science. He specializes in codifying complex requirements into reusable patterns and breaking down difficult topics into accessible content.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleClario enhances the quality of the clinical trial documentation process with Amazon Bedrock
    Next Article Elevate business productivity with Amazon Q and Amazon Connect

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Enhancing System Security and Efficiency through User and Group Management

    Learning Resources

    Impact of Item Classification in Oracle PDH Cloud on Oracle Inventory Management Cloud

    Development

    SEC Updates 24-Year-Old Rule to Scale Customers’ Financial Data Protection

    Development

    The Benefits of Choosing Java for Enterprise Software Development

    Development

    Highlights

    Here’s the first look at Lenovo ThinkBook 16, the Snapdragon X Elite version

    July 3, 2024

    Images of the Lenovo ThinkBook 16 Snapdragon Edition have been leaked online, but it’s not…

    How to maximize your ROI for AI in software development

    August 5, 2024

    How I used ChatGPT to analyze my massive Kindle library – and the mysteries it revealed

    May 1, 2025

    A Breach, an Apology, and a Pledge to Change: SK Chairman Breaks Silence on Telecom Cyberattack

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.