Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Designing Better UX For Left-Handed People

      July 25, 2025

      This week in AI dev tools: Gemini 2.5 Flash-Lite, GitLab Duo Agent Platform beta, and more (July 25, 2025)

      July 25, 2025

      Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

      July 24, 2025

      Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

      July 24, 2025

      I ran with the Apple Watch and Samsung Watch 8 – here’s the better AI coach

      July 26, 2025

      8 smart home gadgets that instantly upgraded my house (and why they work)

      July 26, 2025

      I tested Panasonic’s new affordable LED TV model – here’s my brutally honest buying advice

      July 26, 2025

      OpenAI teases imminent GPT-5 launch. Here’s what to expect

      July 26, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      NativePHP Is Entering Its Next Phase

      July 26, 2025
      Recent

      NativePHP Is Entering Its Next Phase

      July 26, 2025

      Medical Card Generator Android App Project Using SQLite

      July 26, 2025

      The details of TC39’s last meeting

      July 26, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Elden Ring Nightreign’s Patch 1.02 update next week is adding a feature we’ve all been waiting for since launch — and another I’ve been begging for, too

      July 26, 2025
      Recent

      Elden Ring Nightreign’s Patch 1.02 update next week is adding a feature we’ve all been waiting for since launch — and another I’ve been begging for, too

      July 26, 2025

      The next time you look at Microsoft Copilot, it may look back — but who asked for this?

      July 26, 2025

      5 Open Source Apps You Can use for Seamless File Transfer Between Linux and Android

      July 26, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Boost cold-start recommendations with vLLM on AWS Trainium

    Boost cold-start recommendations with vLLM on AWS Trainium

    July 24, 2025

    Cold start in recommendation systems goes beyond just new user or new item problems—it’s the complete absence of personalized signals at launch. When someone first arrives, or when fresh content appears, there’s no behavioral history to tell the engine what they care about, so everyone ends up in broad generic segments. That not only dampens click-through and conversion rates, it can drive users away before a system ever gets a chance to learn their tastes. Standard remedies—collaborative filtering, matrix factorization, or popularity lists—lack the nuance to bridge that signal gap, and their one-size-fits-all suggestions quickly feel stale. Imagine, instead, if you could generate detailed interest profiles from day one. By tapping into large language models (LLMs) for zero-shot reasoning, you can synthesize rich, context-aware user and item embeddings without waiting for weeks of interaction data—turning a cold start into a warm welcome.

    In this post, we demonstrate how to use vLLM for scalable inference and use AWS Deep Learning Containers (DLC) to streamline model packaging and deployment. We’ll generate interest expansions through structured prompts, encode them into embeddings, retrieve candidates with FAISS, apply validation to keep results grounded, and frame the cold-start challenge as a scientific experiment—benchmarking LLM and encoder pairings, iterating rapidly on recommendation metrics, and showing clear ROI for each configuration.

    Solution overview

    We build our cold-start solution on Amazon EC2 Trainium chips. To streamline model deployment, we use DLCs with the AWS Neuron SDK, which installs Neuron-optimized PyTorch modules and includes the latest AWS Trainium drivers and runtime pre-installed.

    A Jupyter-driven workflow that loads data, expands book interest prompts via vLLM, encodes them with sharded encoders using NxD on Amazon EC2 Trn1, and builds FAISS indexes for multiple LLM and encoder variations.

    Figure : Cold-start recommendation pipeline on AWS Trainium with vLLM & NxD

    Sharding large models across multiple Trainium chips is handled by the distributed library used by Neuron, NeuronX Distributed (NxD), which integrates seamlessly with vLLM. NxD manages model partitions across multiple instances with minimal code changes, enabling parallel inference of even 70B parameter LLMs. This combination—Trainium chips, Neuron Tools, and vLLM—gives machine learning (ML) engineers a flexible, cost-efficient, production-ready solution for experimenting with different LLM and encoder configurations and delivers rapid iteration on recommendation quality metrics without modifying core model code.

    In the next section, we orchestrate our experiments in a Jupyter notebook—providing a reproducible, end-to-end workflow from loading data and engineering structured prompts to generating embeddings and retrieving candidates with FAISS—complete with interactive charts to visualize recommendation performance. Then, in the production deep-dive, we walk through a reference implementation that packages your Neuron-optimized LLM and encoder as DLC images and deploys them on Amazon Elastic Kubernetes Service (Amazon EKS) with autoscaling, so your inference layer automatically adapts to demand while optimizing cost and performance.

    Expanding user interest profiles with LLMs

    In this post, we use the Amazon Book Reviews dataset (mohamedbakhet/amazon-books-reviews) from Kaggle, which provides real-world user reviews and metadata for tens of thousands of books. This rich collection lets us simulate cold-start scenarios—where a brand-new user has only a single review or like—and evaluate how well our interest expansions, powered by distilled versions of Meta’s Llama 8B and 70B models, generate rich user profiles. We use an LLM to enrich a new user’s profile from minimal initial data. For example, if a user has only reviewed one science fiction novel, the LLM infers related subtopics—such as galactic empires, cyberpunk dystopias, or space exploration—that the user is likely to enjoy. We use structured prompts that embed the user’s existing activity into a concise instruction to verify consistency and relevance, as demonstrated in the following example:

    prompt = (
    f"The user has shown interest in: {user_review_category}.n"
    "Suggest 3–5 related book topics they might enjoy.n"
    "Respond with a JSON list of topic keywords."
    )
    expanded_topics = llm.generate([prompt])[0].text

    By constraining the LLM’s output format—asking it to return a JSON array of topic keywords—we avoid free‑form tangents and obtain a predictable list of interest expansions. Modern generative models, such as Meta’s Llama, possess broad domain knowledge and human‑like reasoning, enabling them to connect related concepts and serve as powerful cold‑start boosters by inferring deep user preferences from a single review. These synthetic interests become new signals for our recommendation pipeline, allowing us to retrieve and rank books from the Amazon Reviews collection even with minimal user history. You can experiment with Llama variants ranging from one‑billion to seventy‑billion parameters to identify which model yields the most discriminative and relevant expansions. Those findings will guide our choice of model for production and determine the size and scale of the Amazon EC2 Trainium and Inferentia instances we provision, setting us up for live user A/B tests to validate performance in real‑world settings.

    Encoding user interests and retrieving relevant content

    After we have our expanded interests, the next step is to turn both those interests and our catalog of books into vectors that we can compare. We explore three sizes of the Google T5 encoder—base, large and XL—to see how embedding dimensionality affects matching quality. The following are the steps:

    1. Load the encoder for each size
    2. Encode book summaries into a single NumPy matrix and normalize it
    3. Build a FAISS index on those normalized vectors for fast nearest‑neighbor search
    4. Encode the expanded interest text the same way and query FAISS to retrieve the top k most similar books
    from transformers import T5Tokenizer, T5EncoderModel
    import faiss
    import numpy as np
    
    # Our dataset of book summaries
    content_texts = df["review/summary"].tolist()
    encoder_sizes = ["t5-base", "t5-large", "t5-xl"]
    top_k = 5
    
    for size in encoder_sizes:
        # 1. Load the tokenizer and encoder model for this size
        tokenizer = T5Tokenizer.from_pretrained(size)
        model = T5EncoderModel.from_pretrained(size)
    
        # 2. Encode all content into embeddings and normalize
        inputs = tokenizer(content_texts, return_tensors="pt", truncation=True, padding=True)
        outputs = model(**inputs)
        content_embs = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy().astype("float32")
        faiss.normalize_L2(content_embs)
    
        # 3. Build a FAISS index using inner-product (equivalent to cosine on unit vectors)
        index = faiss.IndexFlatIP(content_embs.shape[1])
        index.add(content_embs)
    
        # 4. Encode a single expanded interest and query the index
        interest = "space opera with political intrigue"
        enc = tokenizer([interest], return_tensors="pt", truncation=True, padding=True)
        interest_emb = model(**enc).last_hidden_state.mean(dim=1).detach().cpu().numpy().astype("float32")
        faiss.normalize_L2(interest_emb)
    
        distances, indices = index.search(interest_emb, top_k)
        recommendations = [content_texts[i] for i in indices[0]]
    
        print(f"nTop {top_k} recommendations using {size}:")
        for title in recommendations:
            print(" -", title)

    You can compare how each encoder scale affects both the average FAISS distance (that is, how far apart your interest is from the content) and the actual recommended titles. Swapping in a different encoder family—such as SentenceTransformers—is as straightforward as replacing the model and tokenizer imports.

    Measuring and improving recommendation quality

    Now that we’ve generated FAISS indexes for every LLM‑encoder pairing and computed the mean distance between each expanded interest query and its top 10 neighbors, we know exactly how tightly or loosely each model’s embeddings cluster. The following chart shows those average distances for each combination—revealing that 1B and 3B models collapse to almost zero, while 8B and 70B models (especially with larger encoders) produce progressively higher distances, signifying richer, more discriminative signals for recommendation.

    Bar chart showing average distance from expanded interest to top-10 books. 8B models give more spread than 1B to 3B. 70B adds little.

    Figure : Average FAISS distance by model and encoder

    The chart shows that the 1B and 3B models yield an average FAISS distance of zero, meaning their expanded‑interest embeddings are essentially identical and offer no differentiation. By contrast, the 8B model produces a distance of about 0.5 with t5‑base, rising further with t5‑large and t5‑xl, which demonstrates that larger encoders capture more of the model’s nuance. The 70B model only adds a small boost—and only with the XL encoder—so its extra cost yields limited benefit.

    In practical terms, a Llama 8B LLM paired with a base or large T5 encoder delivers clear separation in embedding space without the higher inference time and resource usage of a 70B model.

    Comparing model and encoder impact on embedding spread

    To see how LLM size and encoder scale shape our embedding space, you can measure—for each LLM and encoder  pair—the mean FAISS distance from a representative expanded interest vector to its top 10 neighbors. The following bar chart plots those averages side by side. You can instantly spot that 1B and 3B collapse to zero, 8B jumps to around 0.5 and rises with larger encoders, and 70B only adds a small extra spread at the XL scale. This helps you choose the smallest combination that still gives you the embedding diversity needed for effective cold‑start recommendations.

    Bar chart comparing FAISS distance across 1B, 3B, 8B, and 70B LLMs with base, large, and XL T5 encoders. Distances are near zero for 1B and 3B; 8B and 70B increase with encoder size.

    Figure : FAISS distance by LLM and encoder size

    Evaluating recommendation overlap across Llama variations and encoders to balance consistency and novelty

    In the next analysis, you build a basic recommend_books helper that, for various LLM sizes and encoder choices, loads the corresponding expanded‑interest DataFrame, reads its FAISS index, reconstructs the first embedding as a stand‑in query, and returns the top-k book titles. Using this helper, we first measure how much each pair of encoders agrees on recommendations for a single LLM—comparing base compared to large, base compared to XL, and large compared XL—and then, separately, how each pair of LLM sizes aligns for a fixed encoder. Finally, we focus on the 8B model (shown in the following figure) and plot a heatmap of its encoder overlaps, which shows that base and large share about 40% of their top 5 picks while XL diverges more—illustrating how changing the encoder shifts the balance between consistency and novelty in the recommendations.

    Heatmap showing % overlap in top-5 books across t5-base, t5-large, and t5-xl. Base pairs overlap 40%; large vs XL only 20%.

    Figure : 8B model: encoder overlap heatmap

    For the 8B model, the heatmap shows that t5_base and t5_large share 40% of their top 5 recommendations, t5_base and t5_xl also overlap 40%, while t5_large vs t5_xl overlap only 20%, indicating that the XL encoder introduces the greatest amount of novel titles compared to the other pairs.

    Tweaking tensor_parallel_size for optimal cost performance

    To balance inference speed against resource cost, we measured how increasing Neuron tensor parallelism affects latency when expanding user interests with the Llama 3.1 8B model on a trn1.32xlarge instance. We ran the same zero‑shot expansion workload at tensor_parallel_size values of 2, 8, 16, and 32. As shown in the first chart, P50 Latency falls by 74 %—from 2,480 ms at TP = 2 to 650 ms at TP = 16—then inches lower to 532 ms at TP = 32 (an additional 18 % drop). The following cost-to-performance chart shows that beyond TP = 16, doubling parallelism roughly doubles cost for only a 17 % further latency gain.

    Line chart: latency drops steeply from TP=2 to TP=16, flattens at TP=32.

    Figure : Latency compared to tensor parallel size

    In practice, setting tensor_parallel_size to 16 delivers the best trade‑off: you capture most of the speed‑up from model sharding while avoiding the sharply diminishing returns and higher core‑hour costs that come with maximal parallelism, as shown in the following figure.

    Bar chart: TP=16 has best efficiency. TP=32 costs more with little gain.

    Figure : Cost-performance compared to tensor parallel size

    The preceding figure visualizes the cost-to-performance ratio of the Llama 8B tests, emphasizing that TP=16 offers the most balanced efficiency before the benefits plateau.

    What’s next?

    Now that we have determined the models and encoders to use, as well as the optimal configuration to use with our dataset, such as sequence size and batch size, the next step is to deploy the models and define a production workflow that generates expanded interest that is encoded and ready for match with more content.

    Conclusion

    This post showed how AWS Trainium, the Neuron SDK, and scalable LLM inference can tackle cold-start challenges by enriching sparse user profiles for better recommendations from day one.

    Importantly, our experiments highlight that larger models and encoders don’t always mean better outcomes. While they can produce richer signals, the gains often don’t justify the added cost. You might find that an 8B LLM with a T5-large encoder strikes the best balance between performance and efficiency.

    Rather than assuming bigger is better, this approach helps teams identify the optimal model-encoder pair—delivering high-quality recommendations with cost-effective infrastructure.


    About the authors

    Yahav Biran is a Principal Architect at AWS, focusing on large-scale AI workloads. He contributes to open-source projects and publishes in AWS blogs and academic journals, including the AWS compute and AI blogs and the Journal of Systems Engineering. He frequently delivers technical presentations and collaborates with customers to design Cloud applications. Yahav holds a Ph.D. in Systems Engineering from Colorado State University.

    Nir Ozeri Nir is a Sr. Solutions Architect Manager with Amazon Web Services, based out of New York City. Nir leads a team of Solution Architects focused on ISV customers. Nir specializes in application modernization, application and product delivery, and scalable application architecture.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow Global Calibration Strengthens Multiaccuracy
    Next Article GitHub Introduces Vibe Coding with Spark: Revolutionizing Intelligent App Development in a Flash

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 26, 2025
    Machine Learning

    RoboBrain 2.0: The Next-Generation Vision-Language Model Unifying Embodied AI for Advanced Robotics

    July 26, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Build a scalable AI video generator using Amazon SageMaker AI and CogVideoX

    Machine Learning

    CVE-2025-4100 – Nautic Pages WordPress Stored Cross-Site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-46341 – FreshRSS HTTP Auth Header Impersonation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Data Migration in Software Modernization. Balancing Automation and Developer’s Expertise

    Web Development

    Highlights

    CVE-2025-21416 – Azure Virtual Desktop Privilege Escalation Vulnerability

    April 30, 2025

    CVE ID : CVE-2025-21416

    Published : April 30, 2025, 6:15 p.m. | 54 minutes ago

    Description : Missing authorization in Azure Virtual Desktop allows an authorized attacker to elevate privileges over a network.

    Severity: 8.5 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Memotron – Your memory atlas

    May 16, 2025

    Last week in AI dev tools: Cloudflare blocking AI crawlers by default, Perplexity Max subscription, and more (July 7, 2025)

    July 7, 2025

    Jakarta EE 11 Platform launches with modernized Test Compatibility Kit framework

    June 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.