Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

    Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

    November 26, 2024

    The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

    In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

    Deploy vLLM on AWS Trainium and Inferentia EC2 instances

    In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.

    Prerequisite: Hugging Face account and model access

    To use the meta-llama/Llama-3.2-1B model, you’ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps. When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.

    Create an EC2 instance

    You can create an EC2 Instance by following the guide. A few things to note:

    1. If this is your first time using inf/trn instances, you will need to request a quota increase.
    2. You will use inf2.xlarge as your instance type. inf2.xlarge instances are only available in these AWS Regions.
    3. Increase the gp3 volume to 100 G.
    4. You will use Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

    After the instance is launched, you can connect to it to access the command line. In the next step, you’ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.

    Start vLLM server

    You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

    cat > Dockerfile <<EOF
    # default base image
    ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
    FROM $BASE_IMAGE
    RUN echo "Base image is $BASE_IMAGE"
    # Install some basic utilities
    RUN apt-get update && 
        apt-get install -y 
            git 
            python3 
            python3-pip 
            ffmpeg libsm6 libxext6 libgl1
    ### Mount Point ###
    # When launching the container, mount the code directory to /app
    ARG APP_MOUNT=/app
    VOLUME [ ${APP_MOUNT} ]
    WORKDIR ${APP_MOUNT}/vllm
    RUN python3 -m pip install --upgrade pip
    RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
    RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
    RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
    RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
    ENV VLLM_TARGET_DEVICE neuron
    RUN git clone https://github.com/vllm-project/vllm.git && 
        cd vllm && 
        git checkout v0.6.2 && 
        python3 -m pip install -U 
            cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 
            -r requirements-neuron.txt && 
        pip install --no-build-isolation -v -e . && 
        pip install --upgrade triton==3.0.0
    CMD ["/bin/bash"]
    EOF

    Then run:

    docker build . -t vllm-neuron

    Building the image will take about 10 minutes. After it’s done, use the new Docker image (replace YOUR_TOKEN_HERE with the token from Hugging Face):

    export HF_TOKEN="YOUR_TOKEN_HERE"
    docker run 
            -it 
            -p 8000:8000 
            --device /dev/neuron0 
            -e HF_TOKEN=$HF_TOKEN 
            -e NEURON_CC_FLAGS=-O1 
            vllm-neuron

    You can now start the vLLM server with the following command:

    vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

    This command runs vLLM with the following parameters:

    • serve meta-llama/Llama-3.2-1B: The Hugging Face modelID of the model that is being deployed for inference.
    • --device neuron: Configures vLLM to run on the neuron device.
    • --tensor-parallel-size 2: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.
    • --max-model-len 4096: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
    • --block-size 8: For neuron devices, this is internally set to the max-model-len.
    • --max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

    The first time you load a model, if there isn’t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:

    Avg prompt throughput: 0.0 tokens/s ...

    This means that the model server is running, but it isn’t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p and ctrl + q.

    Inference

    When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B is running.

    curl localhost:8000/v1/models

    This should return something like:

    {"object":"list","data":[{"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ubuntu@ip-172-31-12-216:~$ 

    Now, send it a prompt:

    curl localhost:8000/v1/completions 
    -H "Content-Type: application/json" 
    -d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

    You should get back a response similar to the following from vLLM:

    ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions 
    -H "Content-Type: application/json" 
    -d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
      % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                     Dload  Upload   Total   Spent  Left  Speed
    100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
    " How does it work?nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
    that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.nGen AI is 
    a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
    situations and environments."

    Offline inference with vLLM

    Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.

    You can reattach to your Docker container and stop the online inference server with the following:

    docker attach $(docker ps --format "{{.ID}}")

    At this point, you should see a blank cursor, press ctrl + c to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:

    cat > offline_inference.py <<EOF
    from vllm.entrypoints.llm import LLM
    from vllm.sampling_params import SamplingParams
    
    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    
    # Create an LLM.
    llm = LLM(model="meta-llama/Llama-3.2-1B",
            max_num_seqs=32,
            max_model_len=4096,
            block_size=8,
            device="neuron",
            tensor_parallel_size=2)
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
    EOF

    Now, run the script python offline_inference.py and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.

    Processed prompts: 100%|
    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s]
    Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
    Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
    Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
    Prompt: 'The future of AI is', Generated text: ' nownThe 10 most influential AI professionals to watch in 2019n'

    You can now type exit and press return and then press ctrl + c to shut down the Docker container and go back to your inf2 instance.

    Clean up

    Now that you’re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.

    Performance tuning for variable sequence lengths

    You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:

    • NEURON_CONTEXT_LENGTH_BUCKETS corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.
    • NEURON_TOKEN_GEN_BUCKETS corresponds to the token generation phase. Set this to a range of powers of two within your generation length.

    You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

    export HF_TOKEN="YOUR_TOKEN_HERE"
    docker run 
            -it 
            -p 8000:8000 
            --device /dev/neuron0 
            -e HF_TOKEN=$HF_TOKEN 
            -e NEURON_CC_FLAGS=-O1 
            -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" 
            -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" 
            vllm-neuron

    You can then start the server using the same command:

    vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

    As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

    curl localhost:8000/v1/completions
    -H "Content-Type: application/json"
    -d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

    For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

    Conclusion

    You’ve just seen how to deploy meta-llama/Llama-3.2-1B using vLLM on an Amazon EC2 Inf2 instance. If you’re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID in the vLLM serve command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.

    After you’ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, we’ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.


    About the authors

    Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.

    Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDeploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM
    Next Article Using LLMs to fortify cyber defenses: Sophos’s insight on strategies for using LLMs with Amazon Bedrock and Amazon SageMaker

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best Figma Plugins for React: Convert Design to Code Faster

    Web Development

    Avowed: Should I have Inquisitor Lödwyn destroy the ruins or have Ryngrim cut off the Adra?

    News & Updates

    Celebrating Brilliance: How to Evaluate and Reward Your AI Agents for Stellar Performance?

    Artificial Intelligence

    Rethinking Information Retrieval in MongoDB with Voyage AI

    Databases

    Highlights

    Development

    Considerations for Operational Technology Cybersecurity

    April 4, 2024

    Operational Technology (OT) refers to the hardware and software used to change, monitor, or control the…

    Meta ditches fact checking for community notes – just like on X

    January 7, 2025

    Rilasciata Nobara Linux 41: Nuove Funzionalità e Miglioramenti per Gaming e Creazione di Contenuti

    January 2, 2025

    How to use Google’s Speech-to-Text API to transcribe audio in Python

    November 12, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.