Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Turning User Research Into Real Organizational Change

      July 1, 2025

      June 2025: All AI updates from the past month

      June 30, 2025

      Building a culture that will drive platform engineering success

      June 30, 2025

      Gartner: More than 40% of agentic AI projects will be canceled in the next few years

      June 30, 2025

      I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

      July 1, 2025

      This gimbal-tracking webcam has TWO cameras and a great price — but it may not be “private” enough

      July 1, 2025

      I spent two months using the massive Area-51 gaming rig — both a powerful beast PC and an RGB beauty queen

      July 1, 2025

      “Using AI is no longer optional” — Did Microsoft just make Copilot mandatory for its staff as a critical performance metric?

      July 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      June report 2025

      July 1, 2025
      Recent

      June report 2025

      July 1, 2025

      Make your JS functions smarter and cleaner with default parameters

      July 1, 2025

      Best Home Interiors in Hyderabad – Top Designers & Affordable Packages

      July 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

      July 1, 2025
      Recent

      I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

      July 1, 2025

      This gimbal-tracking webcam has TWO cameras and a great price — but it may not be “private” enough

      July 1, 2025

      I spent two months using the massive Area-51 gaming rig — both a powerful beast PC and an RGB beauty queen

      July 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

    Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

    June 30, 2025

    Amazon SageMaker Inference has been a popular tool for deploying advanced machine learning (ML) and generative AI models at scale. As AI applications become increasingly complex, customers want to deploy multiple models in a coordinated group that collectively process inference requests for an application. In addition, with the evolution of generative AI applications, many use cases now require inference workflows—sequences of interconnected models operating in predefined logical flows. This trend drives a growing need for more sophisticated inference offerings.

    To address this need, we are introducing a new capability in the SageMaker Python SDK that revolutionizes how you build and deploy inference workflows on SageMaker. We will take Amazon Search as an example to show case how this feature is used in helping customers building inference workflows. This new Python SDK capability provides a streamlined and simplified experience that abstracts away the underlying complexities of packaging and deploying groups of models and their collective inference logic, allowing you to focus on what matter most—your business logic and model integrations.

    In this post, we provide an overview of the user experience, detailing how to set up and deploy these workflows with multiple models using the SageMaker Python SDK. We walk through examples of building complex inference workflows, deploying them to SageMaker endpoints, and invoking them for real-time inference. We also show how customers like Amazon Search plan to use SageMaker Inference workflows to provide more relevant search results to Amazon shoppers.

    Whether you are building a simple two-step process or a complex, multimodal AI application, this new feature provides the tools you need to bring your vision to life. This tool aims to make it easy for developers and businesses to create and manage complex AI systems, helping them build more powerful and efficient AI applications.

    In the following sections, we dive deeper into details of the SageMaker Python SDK, walk through practical examples, and showcase how this new capability can transform your AI development and deployment process.

    Key improvements and user experience

    The SageMaker Python SDK now includes new features for creating and managing inference workflows. These additions aim to address common challenges in developing and deploying inference workflows:

    • Deployment of multiple models – The core of this new experience is the deployment of multiple models as inference components within a single SageMaker endpoint. With this approach, you can create a more unified inference workflow. By consolidating multiple models into one endpoint, you can reduce the number of endpoints that need to be managed. This consolidation can also improve operational tasks, resource utilization, and potentially costs.
    • Workflow definition with workflow mode – The new workflow mode extends the existing Model Builder capabilities. It allows for the definition of inference workflows using Python code. Users familiar with the ModelBuilder class might find this feature to be an extension of their existing knowledge. This mode enables creating multi-step workflows, connecting models, and specifying the data flow between different models in the workflows. The goal is to reduce the complexity of managing these workflows and enable you to focus more on the logic of the resulting compound AI system.
    • Development and deployment options – A new deployment option has been introduced for the development phase. This feature is designed to allow for quicker deployment of workflows to development environments. The intention is to enable faster testing and refinement of workflows. This could be particularly relevant when experimenting with different configurations or adjusting models.
    • Invocation flexibility – The SDK now provides options for invoking individual models or entire workflows. You can choose to call a specific inference component used in a workflow or the entire workflow. This flexibility can be useful in scenarios where access to a specific model is needed, or when only a portion of the workflow needs to be executed.
    • Dependency management – You can use SageMaker Deep Learning Containers (DLCs) or the SageMaker distribution that comes preconfigured with various model serving libraries and tools. These are intended to serve as a starting point for common use cases.

    To get started, use the SageMaker Python SDK to deploy your models as inference components. Then, use the workflow mode to create an inference workflow, represented as Python code using the container of your choice. Deploy the workflow container as another inference component on the same endpoints as the models or a dedicated endpoint. You can run the workflow by invoking the inference component that represents the workflow. The user experience is entirely code-based, using the SageMaker Python SDK. This approach allows you to define, deploy, and manage inference workflows using SDK abstractions offered by this feature and Python programming. The workflow mode provides flexibility to specify complex sequences of model invocations and data transformations, and the option to deploy as components or endpoints caters to various scaling and integration needs.

    Solution overview

    The following diagram illustrates a reference architecture using the SageMaker Python SDK.

    The improved SageMaker Python SDK introduces a more intuitive and flexible approach to building and deploying AI inference workflows. Let’s explore the key components and classes that make up the experience:

    • ModelBuilder simplifies the process of packaging individual models as inference components. It handles model loading, dependency management, and container configuration automatically.
    • The CustomOrchestrator class provides a standardized way to define custom inference logic that orchestrates multiple models in the workflow. Users implement the handle() method to specify this logic and can use an orchestration library or none at all (plain Python).
    • A single deploy() call handles the deployment of the components and workflow orchestrator.
    • The Python SDK supports invocation against the custom inference workflow or individual inference components.
    • The Python SDK supports both synchronous and streaming inference.

    CustomOrchestrator is an abstract base class that serves as a template for defining custom inference orchestration logic. It standardizes the structure of entry point-based inference scripts, making it straightforward for users to create consistent and reusable code. The handle method in the class is an abstract method that users implement to define their custom orchestration logic.

    class CustomOrchestrator (ABC):
    """
    Templated class used to standardize the structure of an entry point based inference script.
    """
    
        @abstractmethod
        def handle(self, data, context=None):
            """abstract class for defining an entrypoint for the model server"""
            return NotImplemented

    With this templated class, users can integrate into their custom workflow code, and then point to this code in the model builder using a file path or directly using a class or method name. Using this class and the ModelBuilder class, it enables a more streamlined workflow for AI inference:

    1. Users define their custom workflow by implementing the CustomOrchestrator class.
    2. The custom CustomOrchestrator is passed to ModelBuilder using the ModelBuilder inference_spec parameter.
    3. ModelBuilder packages the CustomOrchestrator along with the model artifacts.
    4. The packaged model is deployed to a SageMaker endpoint (for example, using a TorchServe container).
    5. When invoked, the SageMaker endpoint uses the custom handle() function defined in the CustomOrchestrator to handle the input payload.

    In the follow sections, we provide two examples of custom workflow orchestrators implemented with plain Python code. For simplicity, the examples use two inference components.

    We explore how to create a simple workflow that deploys two large language models (LLMs) on SageMaker Inference endpoints along with a simple Python orchestrator that calls the two models. We create an IT customer service workflow where one model processes the initial request and another suggests solutions. You can find the example notebook in the GitHub repo.

    Prerequisites

    To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, we host multiple models on the same SageMaker endpoint, so we use two ml.g5.24xlarge SageMaker hosting instances.

    Python inference orchestration

    First, let’s define our custom orchestration class that inherits from CustomOrchestrator. The workflow is structured around a custom inference entry point that handles the request data, processes it, and retrieves predictions from the configured model endpoints. See the following code:

    class PythonCustomInferenceEntryPoint(CustomOrchestrator):
        def __init__(self, region_name, endpoint_name, component_names):
            self.region_name = region_name
            self.endpoint_name = endpoint_name
            self.component_names = component_names
        
        def preprocess(self, data):
            payload = {
                "inputs": data.decode("utf-8")
            }
            return json.dumps(payload)
    
        def _invoke_workflow(self, data):
            # First model (Llama) inference
            payload = self.preprocess(data)
            
            llama_response = self.client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=payload,
                ContentType="application/json",
                InferenceComponentName=self.component_names[0]
            )
            llama_generated_text = json.loads(llama_response.get('Body').read())['generated_text']
            
            # Second model (Mistral) inference
            parameters = {
                "max_new_tokens": 50
            }
            payload = {
                "inputs": llama_generated_text,
                "parameters": parameters
            }
            mistral_response = self.client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=json.dumps(payload),
                ContentType="application/json",
                InferenceComponentName=self.component_names[1]
            )
            return {"generated_text": json.loads(mistral_response.get('Body').read())['generated_text']}
        
        def handle(self, data, context=None):
            return self._invoke_workflow(data)

    This code performs the following functions:

    • Defines the orchestration that sequentially calls two models using their inference component names
    • Processes the response from the first model before passing it to the second model
    • Returns the final generated response

    This plain Python approach provides flexibility and control over the request-response flow, enabling seamless cascading of outputs across multiple model components.

    Build and deploy the workflow

    To deploy the workflow, we first create our inference components and then build the custom workflow. One inference component will host a Meta Llama 3.1 8B model, and the other will host a Mistral 7B model.

    from sagemaker.serve import ModelBuilder
    from sagemaker.serve.builder.schema_builder import SchemaBuilder
    
    # Create a ModelBuilder instance for Llama 3.1 8B
    # Pre-benchmarked ResourceRequirements will be taken from JumpStart, as Llama-3.1-8b is a supported model.
    llama_model_builder = ModelBuilder(
        model="meta-textgeneration-llama-3-1-8b",
        schema_builder=SchemaBuilder(sample_input, sample_output),
        inference_component_name=llama_ic_name,
        instance_type="ml.g5.24xlarge"
    )
    
    # Create a ModelBuilder instance for Mistral 7B model.
    mistral_mb = ModelBuilder(
        model="huggingface-llm-mistral-7b",
        instance_type="ml.g5.24xlarge",
        schema_builder=SchemaBuilder(sample_input, sample_output),
        inference_component_name=mistral_ic_name,
        resource_requirements=ResourceRequirements(
            requests={
               "memory": 49152,
               "num_accelerators": 2,
               "copies": 1
            }
        ),
        instance_type="ml.g5.24xlarge"
    )

    Now we can tie it all together to create one more ModelBuilder to which we pass the modelbuilder_list, which contains the ModelBuilder objects we just created for each inference component and the custom workflow. Then we call the build() function to prepare the workflow for deployment.

    # Create workflow ModelBuilder
    orchestrator= ModelBuilder(
        inference_spec=PythonCustomInferenceEntryPoint(
            region_name=region,
            endpoint_name=llama_mistral_endpoint_name,
            component_names=[llama_ic_name, mistral_ic_name],
        ),
        dependencies={
            "auto": False,
            "custom": [
                "cloudpickle",
                "graphene",
                # Define other dependencies here.
            ],
        },
        sagemaker_session=Session(),
        role_arn=role,
        resource_requirements=ResourceRequirements(
            requests={
               "memory": 4096,
               "num_accelerators": 1,
               "copies": 1,
               "num_cpus": 2
            }
        ),
        name=custom_workflow_name, # Endpoint name for your custom workflow
        schema_builder=SchemaBuilder(sample_input={"inputs": "test"}, sample_output="Test"),
        modelbuilder_list=[llama_model_builder, mistral_mb] # Inference Component ModelBuilders created in Step 2
    )
    # call the build function to prepare the workflow for deployment
    orchestrator.build()

    In the preceding code snippet, you can comment out the section that defines the resource_requirements to have the custom workflow deployed on a separate endpoint instance, which can be a dedicated CPU instance to handle the custom workflow payload.

    By calling the deploy() function, we deploy the custom workflow and the inference components to your desired instance type, in this example ml.g5.24.xlarge. If you choose to deploy the custom workflow to a separate instance, by default, it will use the ml.c5.xlarge instance type. You can set inference_workflow_instance_type and inference_workflow_initial_instance_count to configure the instances required to host the custom workflow.

    predictors = orchestrator.deploy(
        instance_type="ml.g5.24xlarge",
        initial_instance_count=1,
        accept_eula=True, # Required for Llama3
        endpoint_name=llama_mistral_endpoint_name
        # inference_workflow_instance_type="ml.t2.medium", # default
        # inference_workflow_initial_instance_count=1 # default
    )

    Invoke the endpoint

    After you deploy the workflow, you can invoke the endpoint using the predictor object:

    from sagemaker.serializers import JSONSerializer
    predictors[-1].serializer = JSONSerializer()
    predictors[-1].predict("Tell me a story about ducks.")

    You can also invoke each inference component in the deployed endpoint. For example, we can test the Llama inference component with a synchronous invocation, and Mistral with streaming:

    from sagemaker.predictor import Predictor
    # create predictor for the inference component of Llama model
    llama_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=llama_ic_name)
    llama_predictor.content_type = "application/json"
    
    llama_predictor.predict(json.dumps(payload))

    When handling the streaming response, we need to read each line of the output separately. The following example code demonstrates this streaming handling by checking for newline characters to separate and print each token in real time:

    mistral_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=mistral_ic_name)
    mistral_predictor.content_type = "application/json"
    
    body = json.dumps({
        "inputs": prompt,
        # specify the parameters as needed
        "parameters": parameters
    })
    
    for line in mistral_predictor.predict_stream(body):
        decoded_line = line.decode('utf-8')
        if 'n' in decoded_line:
            # Split by newline to handle multiple tokens in the same line
            tokens = decoded_line.split('n')
            for token in tokens[:-1]:  # Print all tokens except the last one with a newline
                print(token)
            # Print the last token without a newline, as it might be followed by more tokens
            print(tokens[-1], end='')
        else:
            # Print the token without a newline if it doesn't contain 'n'
            print(decoded_line, end='')

    So far, we have walked through the example code to demonstrate how to build complex inference logic using Python orchestration, deploy them to SageMaker endpoints, and invoke them for real-time inference. The Python SDK automatically handles the following:

    • Model packaging and container configuration
    • Dependency management and environment setup
    • Endpoint creation and component coordination

    Whether you’re building a simple workflow of two models or a complex multimodal application, the new SDK provides the building blocks needed to bring your inference workflows to life with minimal boilerplate code.

    Customer story: Amazon Search

    Amazon Search is a critical component of the Amazon shopping experience, processing an enormous volume of queries across billions of products across diverse categories. At the core of this system are sophisticated matching and ranking workflows, which determine the order and relevance of search results presented to customers. These workflows execute large deep learning models in predefined sequences, often sharing models across different workflows to improve price-performance and accuracy. This approach makes sure that whether a customer is searching for electronics, fashion items, books, or other products, they receive the most pertinent results tailored to their query.

    The SageMaker Python SDK enhancement offers valuable capabilities that align well with Amazon Search’s requirements for these ranking workflows. It provides a standard interface for developing and deploying complex inference workflows crucial for effective search result ranking. The enhanced Python SDK enables efficient reuse of shared models across multiple ranking workflows while maintaining the flexibility to customize logic for specific product categories. Importantly, it allows individual models within these workflows to scale independently, providing optimal resource allocation and performance based on varying demand across different parts of the search system.

    Amazon Search is exploring the broad adoption of these Python SDK enhancements across their search ranking infrastructure. This initiative aims to further refine and improve search capabilities, enabling the team to build, version, and catalog workflows that power search ranking more effectively across different product categories. The ability to share models across workflows and scale them independently offers new levels of efficiency and adaptability in managing the complex search ecosystem.

    Vaclav Petricek, Sr. Manager of Applied Science at Amazon Search, highlighted the potential impact of these SageMaker Python SDK enhancements: “These capabilities represent a significant advancement in our ability to develop and deploy sophisticated inference workflows that power search matching and ranking. The flexibility to build workflows using Python, share models across workflows, and scale them independently is particularly exciting, as it opens up new possibilities for optimizing our search infrastructure and rapidly iterating on our matching and ranking algorithms as well as new AI features. Ultimately, these SageMaker Inference enhancements will allow us to more efficiently create and manage the complex algorithms powering Amazon’s search experience, enabling us to deliver even more relevant results to our customers.”

    The following diagram illustrates a sample solution architecture used by Amazon Search.

    Clean up

    When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the cleanup section the demo notebook or use following code to delete the model and endpoint created by the demo:

    mistral_predictor.delete_predictor()
    llama_predictor.delete_predictor()
    llama_predictor.delete_endpoint()
    workflow_predictor.delete_predictor()

    Conclusion

    The new SageMaker Python SDK enhancements for inference workflows mark a significant advancement in the development and deployment of complex AI inference workflows. By abstracting the underlying complexities, these enhancements empower inference customers to focus on innovation rather than infrastructure management. This feature bridges sophisticated AI applications with the robust SageMaker infrastructure, enabling developers to use familiar Python-based tools while harnessing the powerful inference capabilities of SageMaker.

    Early adopters, including Amazon Search, are already exploring how these capabilities can drive major improvements in AI-powered customer experiences across diverse industries. We invite all SageMaker users to explore this new functionality, whether you’re developing classic ML models, building generative AI applications or multi-model workflows, or tackling multi-step inference scenarios. The enhanced SDK provides the flexibility, ease of use, and scalability needed to bring your ideas to life. As AI continues to evolve, SageMaker Inference evolves with it, providing you with the tools to stay at the forefront of innovation. Start building your next-generation AI inference workflows today with the enhanced SageMaker Python SDK.


    About the authors

    Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

    Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

    Osho Gupta is a Senior Software Developer at AWS SageMaker. He is passionate about ML infrastructure space, and is motivated to learn & advance underlying technologies that optimize Gen AI training & inference performance. In his spare time, Osho enjoys paddle boarding, hiking, traveling, and spending time with his friends & family.

    Joseph Zhang is a software engineer at AWS. He started his AWS career at EC2 before eventually transitioning to SageMaker, and now works on developing GenAI-related features. Outside of work he enjoys both playing and watching sports (go Warriors!), spending time with family, and making coffee.

    Gary Wang is a Software Developer at AWS SageMaker. He is passionate about AI/ML operations and building new things. In his spare time, Gary enjoys running, hiking, trying new food, and spending time with his friends and family.

    James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

    Vaclav Petricek is a Senior Applied Science Manager at Amazon Search, where he led teams that built Amazon Rufus and now leads science and engineering teams that work on the next generation of Natural Language Shopping. He is passionate about shipping AI experiences that make people’s lives better. Vaclav loves off-piste skiing, playing tennis, and backpacking with his wife and three children.

    Wei Li is a Senior Software Dev Engineer in Amazon Search. She is passionate about Large Language Model training and inference technologies, and loves integrating these solutions into Search Infrastructure to enhance natural language shopping experiences. During her leisure time, she enjoys gardening, painting, and reading.

    Brian Granger is a Senior Principal Technologist at Amazon Web Services and a professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. He works at the intersection of UX design and engineering on tools for scientific computing, data science, machine learning, and data visualization. Brian is a co-founder and leader of Project Jupyter, co-founder of the Altair project for statistical visualization, and creator of the PyZMQ project for ZMQ-based message passing in Python. At AWS he is a technical and open source leader in the AI/ML organization. Brian also represents AWS as a board member of the PyTorch Foundation. He is a winner of the 2017 ACM Software System Award and the 2023 NASA Exceptional Public Achievement Medal for his work on Project Jupyter. He has a Ph.D. in theoretical physics from the University of Colorado.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleWhy the Future Belongs to Enterprises That Build Intelligent Hyperautomation
    Next Article Context extraction from image files in Amazon Q Business using LLMs

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 1, 2025
    Machine Learning

    Instruction-Following Pruning for Large Language Models

    June 30, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Malicious PyPI, npm, and Ruby Packages Exposed in Ongoing Open-Source Supply Chain Attacks

    Development

    From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

    News & Updates

    Ecma International approves ECMAScript 2025: What’s new?

    Development

    Raspberry Pi 5 Desktop Mini PC: Avoid snap pollution

    Linux

    Highlights

    CVE-2024-40461 – Ocuco Innovation Privilege Escalation Vulnerability

    May 22, 2025

    CVE ID : CVE-2024-40461

    Published : May 22, 2025, 7:15 p.m. | 1 hour, 30 minutes ago

    Description : An issue in Ocuco Innovation v.2.10.24.51 allows a local attacker to escalate privileges via the STOCKORDERENTRY.EXE component

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 24/2025

    June 15, 2025

    Google Docs Can Now Edit Encrypted Word Files But Only in Beta As Of Now

    May 21, 2025
    Known insider kills the Titanfall 3 dream, shooting down recent leak: “Titanfall 3 isn’t real”

    Known insider kills the Titanfall 3 dream, shooting down recent leak: “Titanfall 3 isn’t real”

    April 11, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.