Use custom metrics to evaluate your generative AI application with Amazon Bedrock

With Amazon Bedrock Evaluations, you can evaluate foundation models (FMs) and Retrieval Augmented Generation (RAG) systems, whether hosted on Amazon Bedrock or another model or RAG system hosted elsewhere, including Amazon Bedrock Knowledge Bases or multi-cloud and on-premises deployments. We recently announced the general availability of the large language model (LLM)-as-a-judge technique in model evaluation and the new RAG evaluation tool, also powered by an LLM-as-a-judge behind the scenes. These tools are already empowering organizations to systematically evaluate FMs and RAG systems with enterprise-grade tools. We also mentioned that these evaluation tools don’t have to be limited to models or RAG systems hosted on Amazon Bedrock; with the bring your own inference (BYOI) responses feature, you can evaluate models or applications if you use the input formatting requirements for either offering.

The LLM-as-a-judge technique powering these evaluations enables automated, human-like evaluation quality at scale, using FMs to assess quality and responsible AI dimensions without manual intervention. With built-in metrics like correctness (factual accuracy), completeness (response thoroughness), faithfulness (hallucination detection), and responsible AI metrics such as harmfulness and answer refusal, you and your team can evaluate models hosted on Amazon Bedrock and knowledge bases natively, or using BYOI responses from your custom-built systems.

Amazon Bedrock Evaluations offers an extensive list of built-in metrics for both evaluation tools, but there are times when you might want to define these evaluation metrics in a different way, or make completely new metrics that are relevant to your use case. For example, you might want to define a metric that evaluates an application response’s adherence to your specific brand voice, or want to classify responses according to a custom categorical rubric. You might want to use numerical scoring or categorical scoring for various purposes. For these reasons, you need a way to use custom metrics in your evaluations.

Now with Amazon Bedrock, you can develop custom evaluation metrics for both model and RAG evaluations. This capability extends the LLM-as-a-judge framework that drives Amazon Bedrock Evaluations.

In this post, we demonstrate how to use custom metrics in Amazon Bedrock Evaluations to measure and improve the performance of your generative AI applications according to your specific business requirements and evaluation criteria.

Overview

Custom metrics in Amazon Bedrock Evaluations offer the following features:

Simplified getting started experience – Pre-built starter templates are available on the AWS Management Console based on our industry-tested built-in metrics, with options to create from scratch for specific evaluation criteria.
Flexible scoring systems – Support is available for both quantitative (numerical) and qualitative (categorical) scoring to create ordinal metrics, nominal metrics, or even use evaluation tools for classification tasks.
Streamlined workflow management – You can save custom metrics for reuse across multiple evaluation jobs or import previously defined metrics from JSON files.
Dynamic content integration – With built-in template variables (for example, {{prompt}}, {{prediction}}, and {{context}}), you can seamlessly inject dataset content and model outputs into evaluation prompts.
Customizable output control – You can use our recommended output schema for consistent results, with advanced options to define custom output formats for specialized use cases.

Custom metrics give you unprecedented control over how you measure AI system performance, so you can align evaluations with your specific business requirements and use cases. Whether assessing factuality, coherence, helpfulness, or domain-specific criteria, custom metrics in Amazon Bedrock enable more meaningful and actionable evaluation insights.

In the following sections, we walk through the steps to create a job with model evaluation and custom metrics using both the Amazon Bedrock console and the Python SDK and APIs.

Supported data formats

In this section, we review some important data formats.

Judge prompt uploading

To upload your previously saved custom metrics into an evaluation job, follow the JSON format in the following examples.

The following code illustrates a definition with numerical scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "floatValue": 2
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "floatValue": 1
                }
            }
        ]
    }
}

The following code illustrates a definition with string scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "stringValue": "first value"
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "stringValue": "second value"
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "stringValue": "third value"
                }
            }
        ]
    }
}

The following code illustrates a definition with no scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}"
    }
}

For more information on defining a judge prompt with no scale, see the best practices section later in this post.

Model evaluation dataset format

When using LLM-as-a-judge, only one model can be evaluated per evaluation job. Consequently, you must provide a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLM-as-a-judge in model evaluation. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

RAG evaluation dataset format

We updated the evaluation job input dataset format to be even more flexible for RAG evaluation. Now, you can bring referenceContexts, which are expected retrieved passages, so you can compare your actual retrieved contexts to your expected retrieved contexts. You can find the new referenceContexts field in the updated JSONL schema for RAG evaluation:

{
    "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": string
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": string
                }]
            }],
            "referenceContexts" ? : [{
                "content": [{
                    "text": string
                }]
            }],
            "output": {
                "text": string "modelIdentifier" ? : string "knowledgeBaseIdentifier": string "retrievedPassages": {
                    "retrievalResults": [{
                        "name" ? : string "content": {
                            "text": string
                        },
                        "metadata" ? : {
                            [key: string]: string
                        }
                    }]
                }
            }]
    }

Variables for data injection into judge prompts

To make sure that your data is injected into the judge prompts in the right place, use the variables from the following table. We have also included a guide to show you where the evaluation tool will pull data from your input file, if applicable. There are cases where if you bring your own inference responses to the evaluation job, we will use that data from your input file; if you don’t use bring your own inference responses, then we will call the Amazon Bedrock model or knowledge base and prepare the responses for you.

The following table summarizes the variables for model evaluation.

Plain Name	Variable	Input Dataset JSONL Key	Mandatory or Optional
Prompt	`{{prompt}}`	prompt	Optional
Response	`{{prediction}}`	For a BYOI job: `modelResponses.response` If you don’t bring your own inference responses, the evaluation job will call the model and prepare this data for you.	Mandatory
Ground truth response	`{{ground_truth}}`	`referenceResponse`	Optional

The following table summarizes the variables for RAG evaluation (retrieve only).

Plain Name	Variable	Input Dataset JSONL Key	Mandatory or Optional
Prompt	`{{prompt}}`	`prompt`	Optional
Ground truth response	`{{ground_truth}}`	For a BYOI job: `output.retrievedResults.retrievalResults` If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.	Optional
Retrieved passage	`{{context}}`	For a BYOI job: `output.retrievedResults.retrievalResults` If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.	Mandatory
Ground truth retrieved passage	`{{reference_contexts}}`	`referenceContexts`	Optional

The following table summarizes the variables for RAG evaluation (retrieve and generate).

Plain Name	Variable	Input dataset JSONL key	Mandatory or optional
Prompt	`{{prompt}}`	`prompt`	Optional
Response	`{{prediction}}`	For a BYOI job: `Output.text` If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.	Mandatory
Ground truth response	`{{ground_truth}}`	`referenceResponses`	Optional
Retrieved passage	`{{context}}`	For a BYOI job: `Output.retrievedResults.retrievalResults` If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.	Optional
Ground truth retrieved passage	`{{reference_contexts}}`	`referenceContexts`	Optional

Prerequisites

To use the LLM-as-a-judge model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

AWS account and model access:
- An active AWS account
- Selected evaluator and generator models enabled in Amazon Bedrock (verify on the Model access page of the Amazon Bedrock console)
- Confirmed AWS Regions where the models are available and their quotas
AWS Identity and Access Management (IAM) and Amazon Simple Storage Service (Amazon S3) configuration:
- Completed IAM setup and permissions for both model and RAG evaluation
- Configured S3 bucket with appropriate permissions for accessing and writing output data
- Enabled CORS on your S3 bucket

Create a model evaluation job with custom metrics using Amazon Bedrock Evaluations

Complete the following steps to create a job with model evaluation and custom metrics using Amazon Bedrock Evaluations:

On the Amazon Bedrock console, choose Evaluations in the navigation pane and choose the Models
In the Model evaluation section, on the Create dropdown menu, choose Automatic: model as a judge.
For the Model evaluation details, enter an evaluation name and optional description.
For Evaluator model, choose the model you want to use for automatic evaluation.
For Inference source, select the source and choose the model you want to evaluate.

For this example, we chose Claude 3.5 Sonnet as the evaluator model, Bedrock models as our inference source, and Claude 3.5 Haiku as our model to evaluate.

The console will display the default metrics for the evaluator model you chose. You can select other metrics as needed.
In the Custom Metrics section, we create a new metric called “Comprehensiveness.” Use the template provided and modify based on your metrics. You can use the following variables to define the metric, where only {{prediction}} is mandatory:
1. prompt
2. prediction
3. ground_truth

The following is the metric we defined in full:

Your role is to judge the comprehensiveness of an answer based on the question and 
the prediction. Assess the quality, accuracy, and helpfulness of language model response,
 and use these to judge how comprehensive the response is. Award higher scores to responses
 that are detailed and thoughtful.

Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
 against all specified criteria. Assign a single overall score that best represents the 
comprehensivenss, and provide a brief explanation justifying your rating, referencing 
specific strengths and weaknesses observed.

When evaluating the response quality, consider the following rubrics:
- Accuracy: Factual correctness of information provided
- Completeness: Coverage of important aspects of the query
- Clarity: Clear organization and presentation of information
- Helpfulness: Practical utility of the response to the user

Evaluate the following:

Query:
{{prompt}}

Response to evaluate:
{{prediction}}

Create the output schema and additional metrics. Here, we define a scale that provides maximum points (10) if the response is very comprehensive, and 1 if the response is not comprehensive at all.
For Datasets, enter your input and output locations in Amazon S3.
For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose a role.
Choose Create and wait for the job to complete.

Considerations and best practices

When using the output schema of the custom metrics, note the following:

If you use the built-in output schema (recommended), do not add your grading scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes. This is so the evaluation service can parse the judge model’s results and display them on the console in graphs and calculate average values of numerical scores.
The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics. Because judge LLMs are inherently stochastic, there might be some responses we can’t parse and display on the console and use in your average score calculations. However, the raw judge responses are always loaded into your S3 output file, even if the evaluation service cannot parse the response score from the judge model.
If you don’t use the built-in output schema feature (we recommend you use it instead of ignoring it), then you are responsible for providing your rating scale in the judge prompt instructions body. However, the evaluation service will not add structured output instructions and will not parse the results to show graphs; you will see the full judge output plaintext results on the console without graphs and the raw data will still be in your S3 bucket.

Create a model evaluation job with custom metrics using the Python SDK and APIs

To use the Python SDK to create a model evaluation job with custom metrics, follow these steps (or refer to our example notebook):

Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, Amazon S3 paths for input data containing your inference responses, and output location for results:

import boto3
import time
from datetime import datetime

# Configure knowledge base and model settings
evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
generator_model = "amazon.nova-lite-v1:0"
custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_BUCKET_NAME>"

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
output_path = f"s3://{BUCKET_NAME}/evaluation_output/"

# Create Bedrock client
# NOTE: You can change the region name to the region of your choosing.
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

To define a custom metric for model evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}} and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate model outputs consistently according to your specific criteria.

comprehensiveness_metric ={
    "customMetricDefinition": {
        "name": "comprehensiveness",
        "instructions": """Your role is to judge the comprehensiveness of an 
answer based on the question and the prediction. Assess the quality, accuracy, 
and helpfulness of language model response, and use these to judge how comprehensive
 the response is. Award higher scores to responses that are detailed and thoughtful.

Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
 against all specified criteria. Assign a single overall score that best represents the 
comprehensivenss, and provide a brief explanation justifying your rating, referencing 
specific strengths and weaknesses observed.

When evaluating the response quality, consider the following rubrics:
- Accuracy: Factual correctness of information provided
- Completeness: Coverage of important aspects of the query
- Clarity: Clear organization and presentation of information
- Helpfulness: Practical utility of the response to the user

Evaluate the following:

Query:
{{prompt}}

Response to evaluate:
{{prediction}}""",
        "ratingScale": [
            {
                "definition": "Very comprehensive",
                "value": {
                    "floatValue": 10
                }
            },
            {
                "definition": "Mildly comprehensive",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "Not at all comprehensive",
                "value": {
                    "floatValue": 1
                }
            }
        ]
    }
}

To create a model evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (such as Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.

# Create the model evaluation job
model_eval_job_name = f"model-evaluation-custom-metrics{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

model_eval_job = bedrock_client.create_evaluation_job(
    jobName=model_eval_job_name,
    jobDescription="Evaluate model performance with custom comprehensiveness metric",
    roleArn=role_arn,
    applicationType="ModelEvaluation",
    inferenceConfig={
        "models": [{
            "bedrockModel": {
                "modelIdentifier": generator_model
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "General",
                "dataset": {
                    "name": "ModelEvalDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Coherence",
                    "Builtin.Relevance",
                    "Builtin.FollowingInstructions",
                    "comprehensiveness"
                ]
            }],
            "customMetricConfig": {
                "customMetrics": [
                    comprehensiveness_metric
                ],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": custom_metrics_evaluator_model
                    }]
                }
            },
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

print(f"Created model evaluation job: {model_eval_job_name}")
print(f"Job ID: {model_eval_job['jobArn']}")

After submitting the evaluation job, monitor its status with get_evaluation_job and access results at your specified Amazon S3 location when complete, including the standard and custom metric performance data.

Create a RAG system evaluation with custom metrics using Amazon Bedrock Evaluations

In this example, we walk through a RAG system evaluation with a combination of built-in metrics and custom evaluation metrics on the Amazon Bedrock console. Complete the following steps:

On the Amazon Bedrock console, choose Evaluations in the navigation pane.
On the RAG tab, choose Create.
For the RAG evaluation details, enter an evaluation name and optional description.
For Evaluator model, choose the model you want to use for automatic evaluation. The evaluator model selected here will be used to calculate default metrics if selected. For this example, we chose Claude 3.5 Sonnet as the evaluator model.
Include any optional tags.
For Inference source, select the source. Here, you have the option to select between Bedrock Knowledge Bases and Bring your own inference responses. If you’re using Amazon Bedrock Knowledge Bases, you will need to choose a previously created knowledge base or create a new one. For BYOI responses, you can bring the prompt dataset, context, and output from a RAG system. For this example, we chose Bedrock Knowledge Base as our inference source.
Specify the evaluation type, response generator model, and built-in metrics. You can choose between a combined retrieval and response evaluation or a retrieval only evaluation, with options to use default metrics, custom metrics, or both for your RAG evaluation. The response generator model is only required when using an Amazon Bedrock knowledge base as the inference source. For the BYOI configuration, you can proceed without a response generator. For this example, we selected Retrieval and response generation as our evaluation type and chose Nova Lite 1.0 as our response generator model.
In the Custom Metrics section, choose your evaluator model. We selected Claude 3.5 Sonnet v1 as our evaluator model for custom metrics.
Choose Add custom metrics.
Create your new metric. For this example, we create a new custom metric for our RAG evaluation called information_comprehensiveness. This metric evaluates how thoroughly and completely the response addresses the query by using the retrieved information. It measures the extent to which the response extracts and incorporates relevant information from the retrieved passages to provide a comprehensive answer.
You can choose between importing a JSON file, using a preconfigured template, or creating a custom metric with full configuration control. For example, you can select the preconfigured templates for the default metrics and change the scoring system or rubric. For our information_comprehensiveness metric, we select the custom option, which allows us to input our evaluator prompt directly.

For Instructions, enter your prompt. For example:

Your role is to evaluate how comprehensively the response addresses the query 
using the retrieved information. Assess whether the response provides a thorough 
treatment of the subject by effectively utilizing the available retrieved passages.

Carefully evaluate the comprehensiveness of the RAG response for the given query
 against all specified criteria. Assign a single overall score that best represents
 the comprehensiveness, and provide a brief explanation justifying your rating, 
referencing specific strengths and weaknesses observed.

When evaluating response comprehensiveness, consider the following rubrics:
- Coverage: Does the response utilize the key relevant information from the retrieved
 passages?
- Depth: Does the response provide sufficient detail on important aspects from the
 retrieved information?
- Context utilization: How effectively does the response leverage the available
 retrieved passages?
- Information synthesis: Does the response combine retrieved information to create
 a thorough treatment?

Evaluate the following:

Query: {{prompt}}

Retrieved passages: {{context}}

Response to evaluate: {{prediction}}

Enter your output schema to define how the custom metric results will be structured, visualized, normalized (if applicable), and explained by the model.

If you use the built-in output schema (recommended), do not add your rating scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes so that your judge model results can be parsed. The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics.

For Dataset and evaluation results S3 location, enter your input and output locations in Amazon S3.
For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose your role.
Choose Create and wait for the job to complete.

Start a RAG evaluation job with custom metrics using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with custom metrics, follow these steps (or refer to our example notebook):

Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, knowledge base ID, Amazon S3 paths for input data containing your inference responses, and output location for results:

import boto3
import time
from datetime import datetime

# Configure knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
generator_model = "amazon.nova-lite-v1:0"
custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_BUCKET_NAME>"

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
output_path = f"s3://{BUCKET_NAME}/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
# NOTE: You can change the region name to the region of your choosing
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

To define a custom metric for RAG evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}}, {{context}}, and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate responses consistently according to your specific criteria.

# Define our custom information_comprehensiveness metric
information_comprehensiveness_metric = {
    "customMetricDefinition": {
        "name": "information_comprehensiveness",
        "instructions": """
        Your role is to evaluate how comprehensively the response addresses the 
query using the retrieved information. 
        Assess whether the response provides a thorough treatment of the subject
by effectively utilizing the available retrieved passages.

Carefully evaluate the comprehensiveness of the RAG response for the given query
against all specified criteria. 
Assign a single overall score that best represents the comprehensiveness, and 
provide a brief explanation justifying your rating, referencing specific strengths
and weaknesses observed.

When evaluating response comprehensiveness, consider the following rubrics:
- Coverage: Does the response utilize the key relevant information from the 
retrieved passages?
- Depth: Does the response provide sufficient detail on important aspects from 
the retrieved information?
- Context utilization: How effectively does the response leverage the available 
retrieved passages?
- Information synthesis: Does the response combine retrieved information to 
create a thorough treatment?

Evaluate using the following:

Query: {{prompt}}

Retrieved passages: {{context}}

Response to evaluate: {{prediction}}
""",
        "ratingScale": [
            {
                "definition": "Very comprehensive",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "Moderately comprehensive",
                "value": {
                    "floatValue": 2
                }
            },
            {
                "definition": "Minimally comprehensive",
                "value": {
                    "floatValue": 1
                }
            },
            {
                "definition": "Not at all comprehensive",
                "value": {
                    "floatValue": 0
                }
            }
        ]
    }
}

To create a RAG evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your knowledge base ID, generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.

# Create the evaluation job
retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

retrieve_generate_job = bedrock_client.create_evaluation_job(
    jobName=retrieve_generate_job_name,
    jobDescription="Evaluate retrieval and generation with custom metric",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "General",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "information_comprehensiveness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            },
            "customMetricConfig": {
                "customMetrics": [
                    information_comprehensiveness_metric
                ],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": custom_metrics_evaluator_model
                    }]
                }
            }
        }
    }
)

print(f"Created evaluation job: {retrieve_generate_job_name}")
print(f"Job ID: {retrieve_generate_job['jobArn']}")

After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the Amazon S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions including custom metrics.

Custom metrics are only available for LLM-as-a-judge. At the time of writing, we don’t accept custom AWS Lambda functions or endpoints for code-based custom metric evaluators. Human-based model evaluation has supported custom metric definition since its launch in November 2023.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The addition of custom metrics to Amazon Bedrock Evaluations empowers organizations to define their own evaluation criteria for generative AI systems. By extending the LLM-as-a-judge framework with custom metrics, businesses can now measure what matters for their specific use cases alongside built-in metrics. With support for both numerical and categorical scoring systems, these custom metrics enable consistent assessment aligned with organizational standards and goals.

As generative AI becomes increasingly integrated into business processes, the ability to evaluate outputs against custom-defined criteria is essential for maintaining quality and driving continuous improvement. We encourage you to explore these new capabilities through the Amazon Bedrock console and API examples provided, and discover how personalized evaluation frameworks can enhance your AI systems’ performance and business impact.

About the Authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Source: Read MoreÂ