Recently, we’ve been witnessing the rapid development and evolution of generative AI applications, with observability and evaluation emerging as critical aspects for developers, data scientists, and stakeholders. Observability refers to the ability to understand the internal state and behavior of a system by analyzing its outputs, logs, and metrics. Evaluation, on the other hand, involves assessing the quality and relevance of the generated outputs, enabling continual improvement.
Comprehensive observability and evaluation are essential for troubleshooting, identifying bottlenecks, optimizing applications, and providing relevant, high-quality responses. Observability empowers you to proactively monitor and analyze your generative AI applications, and evaluation helps you collect feedback, refine models, and enhance output quality.
In the context of Amazon Bedrock, observability and evaluation become even more crucial. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. As the complexity and scale of these applications grow, providing comprehensive observability and robust evaluation mechanisms are essential for maintaining high performance, quality, and user satisfaction.
We have built a custom observability solution that Amazon Bedrock users can quickly implement using just a few key building blocks and existing logs using FMs, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. This solution uses decorators in your application code to capture and log metadata such as input prompts, output results, run time, and custom metadata, offering enhanced security, ease of use, flexibility, and integration with native AWS services.
Notably, the solution supports comprehensive Retrieval Augmented Generation (RAG) evaluation so you can assess the quality and relevance of generated responses, identify areas for improvement, and refine the knowledge base or model accordingly.
In this post, we set up the custom solution for observability and evaluation of Amazon Bedrock applications. Through code examples and step-by-step guidance, we demonstrate how you can seamlessly integrate this solution into your Amazon Bedrock application, unlocking a new level of visibility, control, and continual improvement for your generative AI applications.
By the end of this post, you will:
- Understand the importance of observability and evaluation in generative AI applications
- Learn about the key features and benefits of this solution
- Gain hands-on experience in implementing the solution through step-by-step demonstrations
- Explore best practices for integrating observability and evaluation into your Amazon Bedrock workflows
Prerequisites
To implement the observability solution discussed in this post, you need the following prerequisites:
- An active Amazon Web Services (AWS) account and AWS Identity and Access Management (IAM) role with Amazon Bedrock access
- Access to the FMs you plan to use
- Basic understanding of decorators in your preferred programming language (Python or Node.js)
- A clone of the amazon-bedrock-samples GitHub repository
- Basic familiarity with AWS services such as Amazon Data Firehose, Amazon Athena, and AWS Glue crawlers (optional, depending on the specific components used in the solution)
Solution overview
The observability solution for Amazon Bedrock empowers users to track and analyze interactions with FMs, knowledge bases, guardrails, and agents using decorators in their source code. Key highlights of the solution include:
- Decorator – Decorators are applied to functions invoking Amazon Bedrock APIs, capturing input prompt, output results, custom metadata, custom metrics, and latency related metrics.
- Flexible logging –You can use this solution to store logs either locally or in Amazon Simple Storage Service (Amazon S3) using Amazon Data Firehose, enabling integration with existing monitoring infrastructure. Additionally, you can choose what gets logged.
- Dynamic data partitioning – The solution enables dynamic partitioning of observability data based on different workflows or components of your application, such as prompt preparation, data preprocessing, feedback collection, and inference. This feature allows you to separate data into logical partitions, making it easier to analyze and process data later.
- Security – The solution uses AWS services and adheres to AWS Cloud Security best practices so your data remains within your AWS account.
- Cost optimization – This solution uses serverless technologies, making it cost-effective for the observability infrastructure. However, some components may incur additional usage-based costs.
- Multiple programming language support – The GitHub repository provides the observability solution in both Python and Node.js versions, catering to different programming preferences.
Here’s a high-level overview of the observability solution architecture:
The following steps explain how the solution works:
- Application code using Amazon Bedrock is decorated with
@bedrock_logs.watch
to save the log - Logged data streams through Amazon Data Firehose
- AWS Lambda transforms the data and applies dynamic partitioning based on
call_type
 variable - Amazon S3 stores the data securely
- Optional components for advanced analytics
- AWS Glue creates tables from S3 data
- Amazon Athena enables data querying
- Visualize logs and insights in your favorite dashboard tool
This architecture provides comprehensive logging, efficient data processing, and powerful analytics capabilities for your Amazon Bedrock applications.
Getting started
To help you get started with the observability solution, we have provided example notebooks in the attached GitHub repository, covering knowledge bases, evaluation, and agents for Amazon Bedrock. These notebooks demonstrate how to integrate the solution into your Amazon Bedrock application and showcase various use cases and features including feedback collected from users or quality assurance (QA) teams.
The repository contains well-documented notebooks that cover topics such as:
- Setting up the observability infrastructure
- Integrating the decorator pattern into your application code
- Logging model inputs, outputs, and custom metadata
- Collecting and analyzing feedback data
- Evaluating model responses and knowledge base performance
- Example visualization for observability data using AWS services
To get started with the example notebooks, follow these steps:
- Clone the GitHub repository
- Navigate to the observability solution directory
- Follow the instructions in the README file to set up the required AWS resources and configure the solution
- Open the provided Jupyter notebooks and follow along with the examples and demonstrations
These notebooks provide a hands-on learning experience and serve as a starting point for integrating our solution into your generative AI applications. Feel free to explore, modify, and adapt the code examples to suit your specific requirements.
Key features
The solution offers a range of powerful features to streamline observability and evaluation for your generative AI applications on Amazon Bedrock:
- Decorator-based implementation – Use decorators to seamlessly integrate observability logging into your application functions, capturing inputs, outputs, and metadata without modifying the core logic
- Selective logging – Choose what to log by selectively capturing function inputs, outputs, or excluding sensitive information or large data structures that might not be relevant for observability
- Logical data partitioning – Create logical partitions in the observability data based on different workflows or application components, enabling easier analysis and processing of specific data subsets
- Human-in-the-loop evaluation – Collect and associate human feedback with specific model responses or sessions, facilitating comprehensive evaluation and continual improvement of your application’s performance and output quality
- Multi-component support – Support observability and evaluation for various Amazon Bedrock components, including
InvokeModel
, batch inference, knowledge bases, agents, and guardrails, providing a unified solution for your generative AI applications - Comprehensive evaluation – Evaluate the quality and relevance of generated responses, including RAG evaluation for knowledge base applications, using the open source RAGAS library to compute evaluation metrics
This concise list highlights the key features you can use to gain insights, optimize performance, and drive continual improvement for your generative AI applications on Amazon Bedrock. For a detailed breakdown of the features and implementation specifics, refer to the comprehensive documentation in the GitHub repository.
Implementation and best practices
The solution is designed to be modular and flexible so you can customize it according to your specific requirements. Although the implementation is straightforward, following best practices is crucial for the scalability, security, and maintainability of your observability infrastructure.
Solution deployment
This solution includes an AWS CloudFormation template that streamlines the deployment of required AWS resources, providing consistent and repeatable deployments across environments. The CloudFormation template provisions resources such as Amazon Data Firehose delivery streams, AWS Lambda functions, Amazon S3 buckets, and AWS Glue crawlers and databases.
Decorator pattern
The solution uses the decorator pattern to integrate observability logging into your application functions seamlessly. The @bedrock_logs.watch
decorator wraps your functions, automatically logging inputs, outputs, and metadata to Amazon Kinesis Firehose. Here’s an example of how to use the decorator:
Human-in-the-loop evaluation
The solution supports human-in-the-loop evaluation so you can incorporate human feedback into the performance evaluation of your generative AI application. You can involve end users, experts, or QA teams in the evaluation process, providing insights to enhance output quality and relevance. Here’s an example of how you can implement human-in-the-loop evaluation:
By using the run_id
and observation_id
generated, you can associate human feedback with specific model responses or sessions. This feedback can then be analyzed and used to refine the knowledge base, fine-tune models, or identify areas for improvement.
Best practices
It’s recommended to follow these best practices:
- Plan call types in advance – Determine the logical partitions (
call_type
) for your observability data based on different workflows or application components. This enables easier analysis and processing of specific data subsets. - Use feedback variables – Configure
feedback_variables=True
when initializingBedrockLogs
to generaterun_id
andobservation_id
. These IDs can be used to join logically partitioned datasets, associating feedback data with corresponding model responses. - Extend for general steps – Although the solution is designed for Amazon Bedrock, you can use the decorator pattern to log observability data for general steps such as prompt preparation, postprocessing, or other custom workflows.
- Log custom metrics – If you need to calculate custom metrics such as latency, context relevance, faithfulness, or any other metric, you can pass these values in the response of your decorated function, and the solution will log them alongside the observability data.
- Selective logging – Use the
capture_input
andcapture_output
parameters to selectively log function inputs or outputs or exclude sensitive information or large data structures that might not be relevant for observability. - Comprehensive evaluation – Evaluate the quality and relevance of generated responses, including RAG evaluation for knowledge base applications, using the
KnowledgeBasesEvaluations
By following these best practices and using the features of the solution, you can set up comprehensive observability and evaluation for your generative AI applications to gain valuable insights, identify areas for improvement, and enhance the overall user experience.
In the next post in this three-part series, we dive deeper into observability and evaluation for RAG and agent-based generative AI applications, providing in-depth insights and guidance.
Clean up
To avoid incurring costs and maintain a clean AWS account, you can remove the associated resources by deleting the AWS CloudFormation stack you created for this walkthrough. You can follow the steps provided in the Deleting a stack on the AWS CloudFormation console documentation to delete the resources created for this solution.
Conclusion and next steps
This comprehensive solution empowers you to seamlessly integrate comprehensive observability into your generative AI applications in Amazon Bedrock. Key benefits include streamlined integration, selective logging, custom metadata tracking, and comprehensive evaluation capabilities, including RAG evaluation. Use AWS services such as Athena to analyze observability data, drive continual improvement, and connect with your favorite dashboard tool to visualize the data.
This post focused is on Amazon Bedrock, but it can be extended to broader machine learning operations (MLOps) workflows or integrated with other AWS services such as AWS Lambda or Amazon SageMaker. We encourage you to explore this solution and integrate it into your workflows. Access the source code and documentation in our GitHub repository and start your integration journey. Embrace the power of observability and unlock new heights for your generative AI applications.
About the authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focused on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Source: Read MoreÂ