Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

In this post, we will walk you through how you can quickly deploy Metaâ€™s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Metaâ€™s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.

Prerequisite: Hugging Face account and model access

To use the meta-llama/Llama-3.2-1B model, youâ€™ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps.Â When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.

Create an EC2 instance

You can create an EC2 Instance by following the guide. A few things to note:

If this is your first time using inf/trn instances, you will need to request a quota increase.
You will use inf2.xlarge as your instance type. inf2.xlarge instances are only available in these AWS Regions.
Increase the gp3 volume to 100 G.
You will useÂ Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

After the instance is launched, you can connect to it to access the command line. In the next step, youâ€™ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.

Start vLLM server

You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && 
    apt-get install -y 
        git 
        python3 
        python3-pip 
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && 
    cd vllm && 
    git checkout v0.6.2 && 
    python3 -m pip install -U 
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 
        -r requirements-neuron.txt && 
    pip install --no-build-isolation -v -e . && 
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Then run:

docker build . -t vllm-neuron

Building the image will take about 10 minutes. After itâ€™s done, use the new Docker image (replace YOUR_TOKEN_HERE with the token from Hugging Face):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        vllm-neuron

You can now start the vLLM server with the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: The Hugging Face modelID of the model that is being deployed for inference.
--device neuron: Configures vLLM to run on the neuron device.
--tensor-parallel-size 2: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.
--max-model-len 4096: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
--block-size 8: For neuron devices, this is internally set to the max-model-len.
--max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

The first time you load a model, if there isnâ€™t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but it isnâ€™t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p and ctrl + q.

Inference

When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1BÂ is running.

curl localhost:8000/v1/models

This should return something like:

{"object":"list","data":[{"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ubuntu@ip-172-31-12-216:~$

Now, send it a prompt:

curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

You should get back a response similar to the following from vLLM:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.nGen AI is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference with vLLM

Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.

You can reattach to your Docker container and stop the online inference server with the following:

docker attach $(docker ps --format "{{.ID}}")

At this point, you should see a blank cursor, press ctrl + c to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:

cat > offline_inference.py <<EOF
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.2-1B",
        max_num_seqs=32,
        max_model_len=4096,
        block_size=8,
        device="neuron",
        tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

EOF

Now, run the script python offline_inference.py and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.

Processed prompts: 100%|
â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4/4 [00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of AI is', Generated text: ' nownThe 10 most influential AI professionals to watch in 2019n'

You can now type exit and press return and then press ctrl + c to shut down the Docker container and go back to your inf2 instance.

Clean up

Now that youâ€™re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.

Performance tuning for variable sequence lengths

You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:

NEURON_CONTEXT_LENGTH_BUCKETS corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.
NEURON_TOKEN_GEN_BUCKETS corresponds to the token generation phase. Set this to a range of powers of two within your generation length.

You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" 
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" 
        vllm-neuron

You can then start the server using the same command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

curl localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

Conclusion

Youâ€™ve just seen how to deploy meta-llama/Llama-3.2-1B using vLLM on an Amazon EC2 Inf2 instance. If youâ€™re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID in the vLLM serve command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.

After youâ€™ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, weâ€™ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.

About the authors

Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.

Pinak PanigrahiÂ works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

Prerequisite: Hugging Face account and model access

Create an EC2 instance

Start vLLM server

Inference

Offline inference with vLLM

Clean up

Performance tuning for variable sequence lengths

Conclusion

About the authors

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

Best Figma Plugins for React: Convert Design to Code Faster

Avowed: Should I have Inquisitor Lödwyn destroy the ruins or have Ryngrim cut off the Adra?

Celebrating Brilliance: How to Evaluate and Reward Your AI Agents for Stellar Performance?

Rethinking Information Retrieval in MongoDB with Voyage AI

Considerations for Operational Technology Cybersecurity

Meta ditches fact checking for community notes – just like on X

Rilasciata Nobara Linux 41: Nuove Funzionalità e Miglioramenti per Gaming e Creazione di Contenuti

How to use Google’s Speech-to-Text API to transcribe audio in Python

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

Prerequisite: Hugging Face account and model access

Create an EC2 instance

Start vLLM server

Inference

Offline inference with vLLM

Clean up

Performance tuning for variable sequence lengths

Conclusion

About the authors

Related Posts