Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

    Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

    July 9, 2024

    As generative artificial intelligence (AI) inference becomes increasingly critical for businesses, customers are seeking ways to scale their generative AI operations or integrate generative AI models into existing workflows. Model optimization has emerged as a crucial step, allowing organizations to balance cost-effectiveness and responsiveness, improving productivity. However, price-performance requirements vary widely across use cases. For chat applications, minimizing latency is key to offer an interactive experience, whereas real-time applications like recommendations require maximizing throughput. Navigating these trade-offs poses a significant challenge to rapidly adopt generative AI, because you must carefully select and evaluate different optimization techniques.

    To overcome these challenges, we are excited to introduce the inference optimization toolkit, a fully managed model optimization feature in Amazon SageMaker. This new feature delivers up to ~2x higher throughput while reducing costs by up to ~50% for generative AI models such as Llama 3, Mistral, and Mixtral models. For example, with a Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48xlarge instance v/s ~1200 tokens/sec previously without any optimization.

    This inference optimization toolkit uses the latest generative AI model optimization techniques such as compilation, quantization, and speculative decoding to help you reduce the time to optimize generative AI models from months to hours, while achieving the best price-performance for your use case. For compilation, the toolkit uses the Neuron Compiler to optimize the model’s computational graph for specific hardware, such as AWS Inferentia, enabling faster runtimes and reduced resource utilization. For quantization, the toolkit utilizes Activation-aware Weight Quantization (AWQ) to efficiently shrink the model size and memory footprint while preserving quality. For speculative decoding, the toolkit employs a faster draft model to predict candidate outputs in parallel, enhancing inference speed for longer text generation tasks. To learn more about each technique, refer to Optimize model inference with Amazon SageMaker. For more details and benchmark results for popular open source models, see Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

    In this post, we demonstrate how to get started with the inference optimization toolkit for supported models in Amazon SageMaker JumpStart and the Amazon SageMaker Python SDK. SageMaker JumpStart is a fully managed model hub that allows you to explore, fine-tune, and deploy popular open source models with just a few clicks. You can use pre-optimized models or create your own custom optimizations. Alternatively, you can accomplish this using the SageMaker Python SDK, as shown in the following notebook. For the full list of supported models, refer to Optimize model inference with Amazon SageMaker.

    Using pre-optimized models in SageMaker JumpStart

    The inference optimization toolkit provides pre-optimized models that have been optimized for best-in-class cost-performance at scale, without any impact to accuracy. You can choose the configuration based on the latency and throughput requirements of your use case and deploy in a single click.

    Taking the Meta-Llama-3-8b model in SageMaker JumpStart as an example, you can choose Deploy from the model page. In the deployment configuration, you can expand the model configuration options, select the number of concurrent users, and deploy the optimized model.

    Deploying a pre-optimized model with the SageMaker Python SDK

    You can also deploy a pre-optimized generative AI model using the SageMaker Python SDK in just a few lines of code. In the following code, we set up a ModelBuilder class for the SageMaker JumpStart model. ModelBuilder is a class in the SageMaker Python SDK that provides fine-grained control over various deployment aspects, such as instance types, network isolation, and resource allocation. You can use it to create a deployable model instance, converting framework models (like XGBoost or PyTorch) or Inference Specs into SageMaker-compatible models for deployment. Refer to Create a model in Amazon SageMaker with ModelBuilder for more details.

    sample_input = {
    “inputs”: “Hello, I’m a language model,”,
    “parameters”: {“max_new_tokens”:128, “do_sample”:True}
    }

    sample_output = [
    {
    “generated_text”: “Hello, I’m a language model, and I’m here to help you with your English.”
    }
    ]
    schema_builder = SchemaBuilder(sample_input, sample_output)

    builder = ModelBuilder(
    model=”meta-textgeneration-llama-3-8b”, # JumpStart model ID
    schema_builder=schema_builder,
    role_arn=role_arn,
    )

    List the available pre-benchmarked configurations with the following code:

    builder.display_benchmark_metrics()

    Choose the appropriate instance_type and config_name from the list based on your concurrent users, latency, and throughput requirements. In the preceding table, you can see the latency and throughput across different concurrency levels for the given instance type and config name. If config name is lmi-optimized, that means the configuration is pre-optimized by SageMaker. Then you can call .build() to run the optimization job. When the job is complete, you can deploy to an endpoint and test the model predictions. See the following code:

    # set deployment config with pre-configured optimization
    bulder.set_deployment_config(
    instance_type=”ml.g5.12xlarge”,
    config_name=”lmi-optimized”
    )

    # build the deployable model
    model = builder.build()

    # deploy the model to a SageMaker endpoint
    predictor = model.deploy(accept_eula=True)

    # use sample input payload to test the deployed endpoint
    predictor.predict(sample_input)

    Using the inference optimization toolkit to create custom optimizations

    In addition to creating a pre-optimized model, you can create custom optimizations based on the instance type you choose. The following table provides a full list of available combinations. In the following sections, we explore compilation on AWS Inferentia first, and then try the other optimization techniques for GPU instances.

    Instance Types
    Optimization Technique
    Configurations

    AWS Inferentia
    Compilation
    Neuron Compiler

    GPUs
    Quantization
    AWQ

    GPUs
    Speculative Decoding
    SageMaker provided or Bring Your Own (BYO) draft model

    Compilation from SageMaker JumpStart

    For compilation, you can select the same Meta-Llama-3-8b model from SageMaker JumpStart and choose Optimize on the model page. In the optimization configuration page, you can choose ml.inf2.8xlarge for your instance type. Then provide an output Amazon Simple Storage Service (Amazon S3) location for the optimized artifacts. For large models like Llama 2 70B, for example, the compilation job can take more than an hour. Therefore, we recommend using the inference optimization toolkit to perform ahead-of-time compilation. That way, you only need to compile one time.

    Compilation using the SageMaker Python SDK

    For the SageMaker Python SDK, you can configure the compilation by changing the environment variables in the .optimize() function. For more details on compilation_config, refer to LMI NeuronX ahead-of-time compilation of models tutorial.

    compiled_model = builder.optimize(
    instance_type=”ml.inf2.8xlarge”,
    accept_eula=True,
    compilation_config={
    “OverrideEnvironment”: {
    “OPTION_TENSOR_PARALLEL_DEGREE”: “2”,
    “OPTION_N_POSITIONS”: “2048”,
    “OPTION_DTYPE”: “fp16”,
    “OPTION_ROLLING_BATCH”: “auto”,
    “OPTION_MAX_ROLLING_BATCH_SIZE”: “4”,
    “OPTION_NEURON_OPTIMIZE_LEVEL”: “2”,
    }
    },
    output_path=f”s3://{output_bucket_name}/compiled/”
    )

    # deploy the compiled model to a SageMaker endpoint
    predictor = compiled_model.deploy(accept_eula=True)

    # use sample input payload to test the deployed endpoint
    predictor.predict(sample_input)

    Quantization and speculative decoding from SageMaker JumpStart

    For optimizing models on GPU, ml.g5.12xlarge is the default deployment instance type for Llama-3-8b. You can choose quantization, speculative decoding, or both as optimization options. Quantization uses AWQ to reduce the model’s weights to low-bit (INT4) representations. Finally, you can provide an output S3 URL to store the optimized artifacts.

    With speculative decoding, you can improve latency and throughput by either using the SageMaker provided draft model or bringing your own draft model from the public Hugging Face model hub or upload from your own S3 bucket.

    After the optimization job is complete, you can deploy the model or run further evaluation jobs on the optimized model. On the SageMaker Studio UI, you can choose to use the default sample datasets or provide your own using an S3 URI. At the time of writing, the evaluate performance option is only available through the Amazon SageMaker Studio UI.

    Quantization and speculative decoding using the SageMaker Python SDK

    The following is the SageMaker Python SDK code snippet for quantization. You just need to provide the quantization_config attribute in the .optimize() function.

    optimized_model = builder.optimize(
    instance_type=”ml.g5.12xlarge”,
    accept_eula=True,
    quantization_config={
    “OverrideEnvironment”: {
    “OPTION_QUANTIZE”: “awq”
    }
    },
    output_path=f”s3://{output_bucket_name}/quantized/”
    )

    # deploy the optimized model to a SageMaker endpoint
    predictor = optimized_model.deploy(accept_eula=True)

    # use sample input payload to test the deployed endpoint
    predictor.predict(sample_input)

    For speculative decoding, you can change to a speculative_decoding_config attribute and configure SageMaker or a custom model. You may need to adjust the GPU utilization based on the sizes of the draft and target models to fit them both on the instance for inference.

    optimized_model = builder.optimize(
    instance_type=”ml.g5.12xlarge”,
    accept_eula=True,
    speculative_decoding_config={
    “ModelProvider”: “sagemaker”
    }
    # speculative_decoding_config={
    # “ModelProvider”: “custom”,
    # use S3 URI or HuggingFace model ID for custom draft model
    # note: using HuggingFace model ID as draft model requires HF_TOKEN in environment variables
    # “ModelSource”: “s3://custom-bucket/draft-model”,
    # }
    )

    # deploy the optimized model to a SageMaker endpoint
    predictor = optimized_model.deploy(accept_eula=True)

    # use sample input payload to test the deployed endpoint
    predictor.predict(sample_input)

    Conclusion

    Optimizing generative AI models for inference performance is crucial for delivering cost-effective and responsive generative AI solutions. With the launch of the inference optimization toolkit, you can now optimize your generative AI models, using the latest techniques such as speculative decoding, compilation, and quantization to achieve up to ~2x higher throughput and reduce costs by up to ~50%. This helps you achieve the optimal price-performance balance for your specific use cases with just a few clicks in SageMaker JumpStart or a few lines of code using the SageMaker Python SDK. The inference optimization toolkit significantly simplifies the model optimization process, enabling your business to accelerate generative AI adoption and unlock more opportunities to drive better business outcomes.

    To learn more, refer to Optimize model inference with Amazon SageMaker and Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

    About the Authors

    James Wu is a Senior AI/ML Specialist Solutions Architect
    Saurabh Trikande is a Senior Product Manager
    Rishabh Ray Chaudhury is a Senior Product Manager
    Kumara Swami Borra is a Front End Engineer
    Alwin (Qiyun) Zhao is a Senior Software Development Engineer
    Qing Lan is a Senior SDE

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleWhat is Feature Testing: Advantages, and How To Test A Feature
    Next Article Icelandic Frumtak Ventures closes fourth fund at $87M

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2024-47893 – VMware GPU Firmware Memory Disclosure

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Josh Krueger of Project Hosts, Inc. Appointed to Federal Secure Cloud Advisory Committee

    Development

    Finally, a ThinkPad model that checks all the boxes for me as a working professional

    Development

    Opera’s Tab Traces has a little trick to keep my browsing on track

    News & Updates

    A New AI Approach for Estimating Causal Effects Using Neural Networks

    Development

    Highlights

    SonicWall: SMA100 VPN vulnerabilities now exploited in attacks

    April 30, 2025

    SonicWall: SMA100 VPN vulnerabilities now exploited in attacks

    ​Cybersecurity company SonicWall has warned customers that several vulnerabilities impacting its Secure Mobile Access (SMA) appliances are now being actively exploited in attacks.
    On Tuesday, SonicWal …
    Read more

    Published Date:
    Apr 30, 2025 (2 hours, 51 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2024-38475

    CVE-2023-44221

    CVE-2021-20035

    Fetch Instagram feeds with vue-instagram

    January 9, 2025

    Understanding Amazon Aurora MySQL storage space utilization

    July 2, 2024

    Kwai-STaR: An AI Framework that Transforms LLMs into State-Transition Reasoners to Improve Their Intuitive Reasoning Capabilities

    November 10, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.