Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Gradient makes LLM benchmarking cost-effective and effortless with AWS Inferentia

    Gradient makes LLM benchmarking cost-effective and effortless with AWS Inferentia

    April 2, 2024

    This is a guest post co-written with Michael Feil at Gradient.

    Evaluating the performance of large language models (LLMs) is an important step of the pre-training and fine-tuning process before deployment. The faster and more frequent you’re able to validate performance, the higher the chances you’ll be able to improve the performance of the model.

    At Gradient, we work on custom LLM development, and just recently launched our AI Development Lab, offering enterprise organizations a personalized, end-to-end development service to build private, custom LLMs and artificial intelligence (AI) co-pilots. As part of this process, we regularly evaluate the performance of our models (tuned, trained, and open) against open and proprietary benchmarks. While working with the AWS team to train our models on AWS Trainium, we realized we were restricted to both VRAM and the availability of GPU instances when it came to the mainstream tool for LLM evaluation, lm-evaluation-harness. This open source framework lets you score different generative language models across various evaluation tasks and benchmarks. It is used by leaderboards such as Hugging Face for public benchmarking.

    To overcome these challenges, we decided to build and open source our solution—integrating AWS Neuron, the library behind AWS Inferentia and Trainium, into lm-evaluation-harness. This integration made it possible to benchmark v-alpha-tross, an early version of our Albatross model, against other public models during the training process and after.

    For context, this integration runs as a new model class within lm-evaluation-harness, abstracting the inference of tokens and log-likelihood estimation of sequences without affecting the actual evaluation task. The decision to move our internal testing pipeline to Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances (powered by AWS Inferentia2) enabled us to access up to 384 GB of shared accelerator memory, effortlessly fitting all of our current public architectures. By using AWS Spot Instances, we were able to take advantage of unused EC2 capacity in the AWS Cloud—enabling cost savings up to 90% discounted from on-demand prices. This minimized the time it took for testing and allowed us to test more frequently because we were able to test across multiple instances that were readily available and release the instances when we were finished.

    In this post, we give a detailed breakdown of our tests, the challenges that we encountered, and an example of using the testing harness on AWS Inferentia.

    Benchmarking on AWS Inferentia2

    The goal of this project was to generate identical scores as shown in the Open LLM Leaderboard (for many CausalLM models available on Hugging Face), while retaining the flexibility to run it against private benchmarks. To see more examples of available models, see AWS Inferentia and Trainium on Hugging Face.

    The code changes required to port over a model from Hugging Face transformers to the Hugging Face Optimum Neuron Python library were quite low. Because lm-evaluation-harness uses AutoModelForCausalLM, there is a drop in replacement using NeuronModelForCausalLM. Without a precompiled model, the model is automatically compiled in the moment, which could add 15–60 minutes onto a job. This gave us the flexibility to deploy testing for any AWS Inferentia2 instance and supported CausalLM model.

    Results

    Because of the way the benchmarks and models work, we didn’t expect the scores to match exactly across different runs. However, they should be very close based on the standard deviation, and we have consistently seen that, as shown in the following table. The initial benchmarks we ran on AWS Inferentia2 were all confirmed by the Hugging Face leaderboard.

    In lm-evaluation-harness, there are two main streams used by different tests: generate_until and loglikelihood. The gsm8k test primarily uses generate_until to generate responses just like during inference. Loglikelihood is mainly used in benchmarking and testing, and examines the probability of different outputs being produced. Both work in Neuron, but the loglikelihood method in SDK 2.16 uses additional steps to determine the probabilities and can take extra time.

    Lm-evaluation-harness Results

    Hardware Configuration
    Original System
    AWS Inferentia inf2.48xlarge

    Time with batch_size=1 to evaluate mistralai/Mistral-7B-Instruct-v0.1 on gsm8k
    103 minutes
    32 minutes

    Score on gsm8k (get-answer – exact_match with std)
    0.3813 – 0.3874 (± 0.0134)
    0.3806 – 0.3844 (± 0.0134)

    Get started with Neuron and lm-evaluation-harness

    The code in this section can help you use lm-evaluation-harness and run it against supported models on Hugging Face. To see some available models, visit AWS Inferentia and Trainium on Hugging Face.

    If you’re familiar with running models on AWS Inferentia2, you might notice that there is no num_cores setting passed in. Our code detects how many cores are available and automatically passes that number in as a parameter. This lets you run the test using the same code regardless of what instance size you are using. You might also notice that we are referencing the original model, not a Neuron compiled version. The harness automatically compiles the model for you as needed.

    The following steps show you how to deploy the Gradient gradientai/v-alpha-tross model we tested. If you want to test with a smaller example on a smaller instance, you can use the mistralai/Mistral-7B-v0.1 model.

    The default quota for running On-Demand Inf instances is 0, so you should request an increase via Service Quotas. Add another request for all Inf Spot Instance requests so you can test with Spot Instances. You will need a quota of 192 vCPUs for this example using an inf2.48xlarge instance, or a quota of 4 vCPUs for a basic inf2.xlarge (if you are deploying the Mistral model). Quotas are AWS Region specific, so make sure you request in us-east-1 or us-west-2.
    Decide on your instance based on your model. Because v-alpha-tross is a 70B architecture, we decided use an inf2.48xlarge instance. Deploy an inf2.xlarge (for the 7B Mistral model). If you are testing a different model, you may need to adjust your instance depending on the size of your model.
    Deploy the instance using the Hugging Face DLAMI version 20240123, so that all the necessary drivers are installed. (The price shown includes the instance cost and there is no additional software charge.)
    Adjust the drive size to 600 GB (100 GB for Mistral 7B).
    Clone and install lm-evaluation-harness on the instance. We specify a build so that we know any variance is due to model changes, not test or code changes.

    git clone https://github.com/EleutherAI/lm-evaluation-harness
    cd lm-evaluation-harness
    # optional: pick specific revision from the main branch version to reproduce the exact results
    git checkout 756eeb6f0aee59fc624c81dcb0e334c1263d80e3
    # install the repository without overwriting the existing torch and torch-neuronx installation
    pip install –no-deps -e .
    pip install peft evaluate jsonlines numexpr pybind11 pytablewriter rouge-score sacrebleu sqlitedict tqdm-multiprocess zstandard hf_transfer

    Run lm_eval with the hf-neuron model type and make sure you have a link to the path back to the model on Hugging Face:

    # e.g use mistralai/Mistral-7B-v0.1 if you are on inf2.xlarge
    MODEL_ID=gradientai/v-alpha-tross

    python -m lm_eval –model “neuronx” –model_args “pretrained=$MODEL_ID,dtype=bfloat16” –batch_size 1 –tasks gsm8k

    If you run the preceding example with Mistral, you should receive the following output (on the smaller inf2.xlarge, it could take 250 minutes to run):

    ███████████████████████| 1319/1319 [32:52<00:00, 1.50s/it]
    neuronx (pretrained=mistralai/Mistral-7B-v0.1,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
    |Tasks|Version| Filter |n-shot| Metric |Value | |Stderr|
    |—–|——:|———-|—–:|———–|—–:|—|—–:|
    |gsm8k| 2|get-answer| 5|exact_match|0.3806|± |0.0134|

    Clean up

    When you are done, be sure to stop the EC2 instances via the Amazon EC2 console.

    Conclusion

    The Gradient and Neuron teams are excited to see a broader adoption of LLM evaluation with this release. Try it out yourself and run the most popular evaluation framework on AWS Inferentia2 instances. You can now benefit from the on-demand availability of AWS Inferentia2 when you’re using custom LLM development from Gradient. Get started hosting models on AWS Inferentia with these tutorials.

    About the Authors

    Michael Feil is an AI engineer at Gradient and previously worked as a ML engineer at Rodhe & Schwarz and a researcher at Max-Plank Institute for Intelligent Systems and Bosch Rexroth. Michael is a leading contributor to various open source inference libraries for LLMs and open source projects such as StarCoder. Michael holds a bachelor’s degree in mechatronics and IT from KIT and a master’s degree in robotics from Technical University of Munich.

    Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups like Gradient. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, a Neuron Ambassador, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow RocketReach stabilized Amazon Aurora costs and improved performance with Amazon Aurora I/O-Optimized
    Next Article Enable single sign-on access of Amazon SageMaker Canvas using AWS IAM Identity Center: Part 2

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4831 – TOTOLINK HTTP POST Request Handler Buffer Overflow Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-3670 – WordPress KiwiChat NextClient Stored Cross-Site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3769 – LatePoint WordPress Calendar Booking Plugin Insecure Direct Object Reference Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Experience the Future of AI at Agentforce World Tour Atlanta

    Development

    Announcing New Language Support for PII Text Redaction and Expanding Entity Detection

    Artificial Intelligence

    Highlights

    Development

    WACK: Advancing Hallucination Detection by Identifying Knowledge-Based Errors in Language Models Through Model-Specific, High-Precision Datasets and Prompting Techniques

    November 1, 2024

    Large Language Models (LLMs) are widely used in natural language tasks, from question-answering to conversational…

    AB Download Manager – manage and organize downloading files

    January 6, 2025

    Samsung MagicINFO Vulnerability Allows Remote Code Execution Without Valid User

    April 30, 2025

    The next LTS Linux kernel is no surprise but it is packed with goodies

    December 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.