Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Decoding The SVG path Element: Curve And Arc Commands

      June 23, 2025

      This week in AI dev tools: Gemini 2.5 Pro and Flash GA, GitHub Copilot Spaces, and more (June 20, 2025)

      June 20, 2025

      Gemini 2.5 Pro and Flash are generally available and Gemini 2.5 Flash-Lite preview is announced

      June 19, 2025

      CSS Cascade Layers Vs. BEM Vs. Utility Classes: Specificity Control

      June 19, 2025

      I recommend this Chromebook over many Windows laptops that cost twice as much

      June 23, 2025

      Why I recommend this flagship TCL TV over OLED models that cost more (and don’t regret it)

      June 23, 2025

      Finally, a Lenovo ThinkPad that impressed me in performance, design, and battery life

      June 23, 2025

      3 productivity gadgets I can’t work without (and why they make such a big difference)

      June 23, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      SQL Joins

      June 23, 2025
      Recent

      SQL Joins

      June 23, 2025

      Dividing Collections with Laravel’s splitIn Helper

      June 23, 2025

      PayHere for Laravel

      June 23, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Distribution Release: IPFire 2.29 Core 195

      June 23, 2025
      Recent

      Distribution Release: IPFire 2.29 Core 195

      June 23, 2025

      TeleSculptor – transforms aerial videos and images into Geospatial 3D models

      June 23, 2025

      Rilasciato IceWM 3.8: Gestore di Finestre per il Sistema X

      June 23, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Use custom metrics to evaluate your generative AI application with Amazon Bedrock

    Use custom metrics to evaluate your generative AI application with Amazon Bedrock

    May 6, 2025

    With Amazon Bedrock Evaluations, you can evaluate foundation models (FMs) and Retrieval Augmented Generation (RAG) systems, whether hosted on Amazon Bedrock or another model or RAG system hosted elsewhere, including Amazon Bedrock Knowledge Bases or multi-cloud and on-premises deployments. We recently announced the general availability of the large language model (LLM)-as-a-judge technique in model evaluation and the new RAG evaluation tool, also powered by an LLM-as-a-judge behind the scenes. These tools are already empowering organizations to systematically evaluate FMs and RAG systems with enterprise-grade tools. We also mentioned that these evaluation tools don’t have to be limited to models or RAG systems hosted on Amazon Bedrock; with the bring your own inference (BYOI) responses feature, you can evaluate models or applications if you use the input formatting requirements for either offering.

    The LLM-as-a-judge technique powering these evaluations enables automated, human-like evaluation quality at scale, using FMs to assess quality and responsible AI dimensions without manual intervention. With built-in metrics like correctness (factual accuracy), completeness (response thoroughness), faithfulness (hallucination detection), and responsible AI metrics such as harmfulness and answer refusal, you and your team can evaluate models hosted on Amazon Bedrock and knowledge bases natively, or using BYOI responses from your custom-built systems.

    Amazon Bedrock Evaluations offers an extensive list of built-in metrics for both evaluation tools, but there are times when you might want to define these evaluation metrics in a different way, or make completely new metrics that are relevant to your use case. For example, you might want to define a metric that evaluates an application response’s adherence to your specific brand voice, or want to classify responses according to a custom categorical rubric. You might want to use numerical scoring or categorical scoring for various purposes. For these reasons, you need a way to use custom metrics in your evaluations.

    Now with Amazon Bedrock, you can develop custom evaluation metrics for both model and RAG evaluations. This capability extends the LLM-as-a-judge framework that drives Amazon Bedrock Evaluations.

    In this post, we demonstrate how to use custom metrics in Amazon Bedrock Evaluations to measure and improve the performance of your generative AI applications according to your specific business requirements and evaluation criteria.

    Overview

    Custom metrics in Amazon Bedrock Evaluations offer the following features:

    • Simplified getting started experience – Pre-built starter templates are available on the AWS Management Console based on our industry-tested built-in metrics, with options to create from scratch for specific evaluation criteria.
    • Flexible scoring systems – Support is available for both quantitative (numerical) and qualitative (categorical) scoring to create ordinal metrics, nominal metrics, or even use evaluation tools for classification tasks.
    • Streamlined workflow management – You can save custom metrics for reuse across multiple evaluation jobs or import previously defined metrics from JSON files.
    • Dynamic content integration – With built-in template variables (for example, {{prompt}}, {{prediction}}, and {{context}}), you can seamlessly inject dataset content and model outputs into evaluation prompts.
    • Customizable output control – You can use our recommended output schema for consistent results, with advanced options to define custom output formats for specialized use cases.

    Custom metrics give you unprecedented control over how you measure AI system performance, so you can align evaluations with your specific business requirements and use cases. Whether assessing factuality, coherence, helpfulness, or domain-specific criteria, custom metrics in Amazon Bedrock enable more meaningful and actionable evaluation insights.

    In the following sections, we walk through the steps to create a job with model evaluation and custom metrics using both the Amazon Bedrock console and the Python SDK and APIs.

    Supported data formats

    In this section, we review some important data formats.

    Judge prompt uploading

    To upload your previously saved custom metrics into an evaluation job, follow the JSON format in the following examples.

    The following code illustrates a definition with numerical scale:

    {
        "customMetricDefinition": {
            "metricName": "my_custom_metric",
            "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
            "ratingScale": [
                {
                    "definition": "first rating definition",
                    "value": {
                        "floatValue": 3
                    }
                },
                {
                    "definition": "second rating definition",
                    "value": {
                        "floatValue": 2
                    }
                },
                {
                    "definition": "third rating definition",
                    "value": {
                        "floatValue": 1
                    }
                }
            ]
        }
    }

    The following code illustrates a definition with string scale:

    {
        "customMetricDefinition": {
            "metricName": "my_custom_metric",
            "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
            "ratingScale": [
                {
                    "definition": "first rating definition",
                    "value": {
                        "stringValue": "first value"
                    }
                },
                {
                    "definition": "second rating definition",
                    "value": {
                        "stringValue": "second value"
                    }
                },
                {
                    "definition": "third rating definition",
                    "value": {
                        "stringValue": "third value"
                    }
                }
            ]
        }
    }

    The following code illustrates a definition with no scale:

    {
        "customMetricDefinition": {
            "metricName": "my_custom_metric",
            "instructions": "Your complete custom metric prompt including at least one {{input variable}}"
        }
    }

    For more information on defining a judge prompt with no scale, see the best practices section later in this post.

    Model evaluation dataset format

    When using LLM-as-a-judge, only one model can be evaluated per evaluation job. Consequently, you must provide a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLM-as-a-judge in model evaluation. Fields marked with ? are optional.

    {
        "prompt": string
        "referenceResponse"?: string
        "category"?: string
         "modelResponses"?: [
            {
                "response": string
                "modelIdentifier": string
            }
        ]
    }

    RAG evaluation dataset format

    We updated the evaluation job input dataset format to be even more flexible for RAG evaluation. Now, you can bring referenceContexts, which are expected retrieved passages, so you can compare your actual retrieved contexts to your expected retrieved contexts. You can find the new referenceContexts field in the updated JSONL schema for RAG evaluation:

    {
        "conversationTurns": [{
                "prompt": {
                    "content": [{
                        "text": string
                    }]
                },
                "referenceResponses": [{
                    "content": [{
                        "text": string
                    }]
                }],
                "referenceContexts" ? : [{
                    "content": [{
                        "text": string
                    }]
                }],
                "output": {
                    "text": string "modelIdentifier" ? : string "knowledgeBaseIdentifier": string "retrievedPassages": {
                        "retrievalResults": [{
                            "name" ? : string "content": {
                                "text": string
                            },
                            "metadata" ? : {
                                [key: string]: string
                            }
                        }]
                    }
                }]
        }

    Variables for data injection into judge prompts

    To make sure that your data is injected into the judge prompts in the right place, use the variables from the following table. We have also included a guide to show you where the evaluation tool will pull data from your input file, if applicable. There are cases where if you bring your own inference responses to the evaluation job, we will use that data from your input file; if you don’t use bring your own inference responses, then we will call the Amazon Bedrock model or knowledge base and prepare the responses for you.

    The following table summarizes the variables for model evaluation.

    Plain Name Variable Input Dataset JSONL Key Mandatory or Optional
    Prompt {{prompt}} prompt Optional
    Response {{prediction}}

    For a BYOI job:

    modelResponses.response 

    If you don’t bring your own inference responses, the evaluation job will call the model and prepare this data for you.

    Mandatory
    Ground truth response {{ground_truth}} referenceResponse Optional

    The following table summarizes the variables for RAG evaluation (retrieve only).

    Plain Name Variable Input Dataset JSONL Key Mandatory or Optional
    Prompt {{prompt}} prompt Optional
    Ground truth response {{ground_truth}}

    For a BYOI job:

    output.retrievedResults.retrievalResults 

    If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

    Optional
    Retrieved passage {{context}}

    For a BYOI job:

    output.retrievedResults.retrievalResults 

    If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

    Mandatory
    Ground truth retrieved passage {{reference_contexts}} referenceContexts Optional

    The following table summarizes the variables for RAG evaluation (retrieve and generate).

    Plain Name Variable Input dataset JSONL key Mandatory or optional
    Prompt {{prompt}} prompt Optional
    Response {{prediction}}

    For a BYOI job:

    Output.text

    If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

    Mandatory
    Ground truth response {{ground_truth}} referenceResponses Optional
    Retrieved passage {{context}}

    For a BYOI job:

    Output.retrievedResults.retrievalResults

    If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

    Optional
    Ground truth retrieved passage {{reference_contexts}} referenceContexts Optional

    Prerequisites

    To use the LLM-as-a-judge model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

    • AWS account and model access:
      • An active AWS account
      • Selected evaluator and generator models enabled in Amazon Bedrock (verify on the Model access page of the Amazon Bedrock console)
      • Confirmed AWS Regions where the models are available and their quotas
    • AWS Identity and Access Management (IAM) and Amazon Simple Storage Service (Amazon S3) configuration:
      • Completed IAM setup and permissions for both model and RAG evaluation
      • Configured S3 bucket with appropriate permissions for accessing and writing output data
      • Enabled CORS on your S3 bucket

    Create a model evaluation job with custom metrics using Amazon Bedrock Evaluations

    Complete the following steps to create a job with model evaluation and custom metrics using Amazon Bedrock Evaluations:

    1. On the Amazon Bedrock console, choose Evaluations in the navigation pane and choose the Models
    2. In the Model evaluation section, on the Create dropdown menu, choose Automatic: model as a judge.
    3. For the Model evaluation details, enter an evaluation name and optional description.
    4. For Evaluator model, choose the model you want to use for automatic evaluation.
    5. For Inference source, select the source and choose the model you want to evaluate.

    For this example, we chose Claude 3.5 Sonnet as the evaluator model, Bedrock models as our inference source, and Claude 3.5 Haiku as our model to evaluate.

    1. The console will display the default metrics for the evaluator model you chose. You can select other metrics as needed.
    2. In the Custom Metrics section, we create a new metric called “Comprehensiveness.” Use the template provided and modify based on your metrics. You can use the following variables to define the metric, where only {{prediction}} is mandatory:
      1. prompt
      2. prediction
      3. ground_truth

    The following is the metric we defined in full:

    Your role is to judge the comprehensiveness of an answer based on the question and 
    the prediction. Assess the quality, accuracy, and helpfulness of language model response,
     and use these to judge how comprehensive the response is. Award higher scores to responses
     that are detailed and thoughtful.
    
    Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
     against all specified criteria. Assign a single overall score that best represents the 
    comprehensivenss, and provide a brief explanation justifying your rating, referencing 
    specific strengths and weaknesses observed.
    
    When evaluating the response quality, consider the following rubrics:
    - Accuracy: Factual correctness of information provided
    - Completeness: Coverage of important aspects of the query
    - Clarity: Clear organization and presentation of information
    - Helpfulness: Practical utility of the response to the user
    
    Evaluate the following:
    
    Query:
    {{prompt}}
    
    Response to evaluate:
    {{prediction}}

    1. Create the output schema and additional metrics. Here, we define a scale that provides maximum points (10) if the response is very comprehensive, and 1 if the response is not comprehensive at all.
    2. For Datasets, enter your input and output locations in Amazon S3.
    3. For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose a role.
    4. Choose Create and wait for the job to complete.

    Considerations and best practices

    When using the output schema of the custom metrics, note the following:

    • If you use the built-in output schema (recommended), do not add your grading scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes. This is so the evaluation service can parse the judge model’s results and display them on the console in graphs and calculate average values of numerical scores.
    • The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics. Because judge LLMs are inherently stochastic, there might be some responses we can’t parse and display on the console and use in your average score calculations. However, the raw judge responses are always loaded into your S3 output file, even if the evaluation service cannot parse the response score from the judge model.
    • If you don’t use the built-in output schema feature (we recommend you use it instead of ignoring it), then you are responsible for providing your rating scale in the judge prompt instructions body. However, the evaluation service will not add structured output instructions and will not parse the results to show graphs; you will see the full judge output plaintext results on the console without graphs and the raw data will still be in your S3 bucket.

    Create a model evaluation job with custom metrics using the Python SDK and APIs

    To use the Python SDK to create a model evaluation job with custom metrics, follow these steps (or refer to our example notebook):

    1. Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, Amazon S3 paths for input data containing your inference responses, and output location for results:
      import boto3
      import time
      from datetime import datetime
      
      # Configure knowledge base and model settings
      evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
      generator_model = "amazon.nova-lite-v1:0"
      custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
      role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
      BUCKET_NAME = "<YOUR_BUCKET_NAME>"
      
      # Specify S3 locations
      input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
      output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
      
      # Create Bedrock client
      # NOTE: You can change the region name to the region of your choosing.
      bedrock_client = boto3.client('bedrock', region_name='us-east-1') 
    2. To define a custom metric for model evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}} and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate model outputs consistently according to your specific criteria.
      comprehensiveness_metric ={
          "customMetricDefinition": {
              "name": "comprehensiveness",
              "instructions": """Your role is to judge the comprehensiveness of an 
      answer based on the question and the prediction. Assess the quality, accuracy, 
      and helpfulness of language model response, and use these to judge how comprehensive
       the response is. Award higher scores to responses that are detailed and thoughtful.
      
      Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
       against all specified criteria. Assign a single overall score that best represents the 
      comprehensivenss, and provide a brief explanation justifying your rating, referencing 
      specific strengths and weaknesses observed.
      
      When evaluating the response quality, consider the following rubrics:
      - Accuracy: Factual correctness of information provided
      - Completeness: Coverage of important aspects of the query
      - Clarity: Clear organization and presentation of information
      - Helpfulness: Practical utility of the response to the user
      
      Evaluate the following:
      
      Query:
      {{prompt}}
      
      Response to evaluate:
      {{prediction}}""",
              "ratingScale": [
                  {
                      "definition": "Very comprehensive",
                      "value": {
                          "floatValue": 10
                      }
                  },
                  {
                      "definition": "Mildly comprehensive",
                      "value": {
                          "floatValue": 3
                      }
                  },
                  {
                      "definition": "Not at all comprehensive",
                      "value": {
                          "floatValue": 1
                      }
                  }
              ]
          }
      }
    3. To create a model evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (such as Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.
      # Create the model evaluation job
      model_eval_job_name = f"model-evaluation-custom-metrics{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
      
      model_eval_job = bedrock_client.create_evaluation_job(
          jobName=model_eval_job_name,
          jobDescription="Evaluate model performance with custom comprehensiveness metric",
          roleArn=role_arn,
          applicationType="ModelEvaluation",
          inferenceConfig={
              "models": [{
                  "bedrockModel": {
                      "modelIdentifier": generator_model
                  }
              }]
          },
          outputDataConfig={
              "s3Uri": output_path
          },
          evaluationConfig={
              "automated": {
                  "datasetMetricConfigs": [{
                      "taskType": "General",
                      "dataset": {
                          "name": "ModelEvalDataset",
                          "datasetLocation": {
                              "s3Uri": input_data
                          }
                      },
                      "metricNames": [
                          "Builtin.Correctness",
                          "Builtin.Completeness",
                          "Builtin.Coherence",
                          "Builtin.Relevance",
                          "Builtin.FollowingInstructions",
                          "comprehensiveness"
                      ]
                  }],
                  "customMetricConfig": {
                      "customMetrics": [
                          comprehensiveness_metric
                      ],
                      "evaluatorModelConfig": {
                          "bedrockEvaluatorModels": [{
                              "modelIdentifier": custom_metrics_evaluator_model
                          }]
                      }
                  },
                  "evaluatorModelConfig": {
                      "bedrockEvaluatorModels": [{
                          "modelIdentifier": evaluator_model
                      }]
                  }
              }
          }
      )
      
      print(f"Created model evaluation job: {model_eval_job_name}")
      print(f"Job ID: {model_eval_job['jobArn']}")
    4. After submitting the evaluation job, monitor its status with get_evaluation_job and access results at your specified Amazon S3 location when complete, including the standard and custom metric performance data.

    Create a RAG system evaluation with custom metrics using Amazon Bedrock Evaluations

    In this example, we walk through a RAG system evaluation with a combination of built-in metrics and custom evaluation metrics on the Amazon Bedrock console. Complete the following steps:

    1. On the Amazon Bedrock console, choose Evaluations in the navigation pane.
    2. On the RAG tab, choose Create.
    3. For the RAG evaluation details, enter an evaluation name and optional description.
    4. For Evaluator model, choose the model you want to use for automatic evaluation. The evaluator model selected here will be used to calculate default metrics if selected. For this example, we chose Claude 3.5 Sonnet as the evaluator model.
    5. Include any optional tags.
    6. For Inference source, select the source. Here, you have the option to select between Bedrock Knowledge Bases and Bring your own inference responses. If you’re using Amazon Bedrock Knowledge Bases, you will need to choose a previously created knowledge base or create a new one. For BYOI responses, you can bring the prompt dataset, context, and output from a RAG system. For this example, we chose Bedrock Knowledge Base as our inference source.
    7. Specify the evaluation type, response generator model, and built-in metrics. You can choose between a combined retrieval and response evaluation or a retrieval only evaluation, with options to use default metrics, custom metrics, or both for your RAG evaluation. The response generator model is only required when using an Amazon Bedrock knowledge base as the inference source. For the BYOI configuration, you can proceed without a response generator. For this example, we selected Retrieval and response generation as our evaluation type and chose Nova Lite 1.0 as our response generator model.
    8. In the Custom Metrics section, choose your evaluator model. We selected Claude 3.5 Sonnet v1 as our evaluator model for custom metrics.
    9. Choose Add custom metrics.
    10. Create your new metric. For this example, we create a new custom metric for our RAG evaluation called information_comprehensiveness. This metric evaluates how thoroughly and completely the response addresses the query by using the retrieved information. It measures the extent to which the response extracts and incorporates relevant information from the retrieved passages to provide a comprehensive answer.
    11. You can choose between importing a JSON file, using a preconfigured template, or creating a custom metric with full configuration control. For example, you can select the preconfigured templates for the default metrics and change the scoring system or rubric. For our information_comprehensiveness metric, we select the custom option, which allows us to input our evaluator prompt directly.
    12. For Instructions, enter your prompt. For example:
      Your role is to evaluate how comprehensively the response addresses the query 
      using the retrieved information. Assess whether the response provides a thorough 
      treatment of the subject by effectively utilizing the available retrieved passages.
      
      Carefully evaluate the comprehensiveness of the RAG response for the given query
       against all specified criteria. Assign a single overall score that best represents
       the comprehensiveness, and provide a brief explanation justifying your rating, 
      referencing specific strengths and weaknesses observed.
      
      When evaluating response comprehensiveness, consider the following rubrics:
      - Coverage: Does the response utilize the key relevant information from the retrieved
       passages?
      - Depth: Does the response provide sufficient detail on important aspects from the
       retrieved information?
      - Context utilization: How effectively does the response leverage the available
       retrieved passages?
      - Information synthesis: Does the response combine retrieved information to create
       a thorough treatment?
      
      Evaluate the following:
      
      Query: {{prompt}}
      
      Retrieved passages: {{context}}
      
      Response to evaluate: {{prediction}}
    13. Enter your output schema to define how the custom metric results will be structured, visualized, normalized (if applicable), and explained by the model.

    If you use the built-in output schema (recommended), do not add your rating scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes so that your judge model results can be parsed. The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics.

    1. For Dataset and evaluation results S3 location, enter your input and output locations in Amazon S3.
    2. For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose your role.
    3. Choose Create and wait for the job to complete.

    Start a RAG evaluation job with custom metrics using the Python SDK and APIs

    To use the Python SDK for creating an RAG evaluation job with custom metrics, follow these steps (or refer to our example notebook):

    1. Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, knowledge base ID, Amazon S3 paths for input data containing your inference responses, and output location for results:
      import boto3
      import time
      from datetime import datetime
      
      # Configure knowledge base and model settings
      knowledge_base_id = "<YOUR_KB_ID>"
      evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
      generator_model = "amazon.nova-lite-v1:0"
      custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
      role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
      BUCKET_NAME = "<YOUR_BUCKET_NAME>"
      
      # Specify S3 locations
      input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
      output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
      
      # Configure retrieval settings
      num_results = 10
      search_type = "HYBRID"
      
      # Create Bedrock client
      # NOTE: You can change the region name to the region of your choosing
      bedrock_client = boto3.client('bedrock', region_name='us-east-1') 
    2. To define a custom metric for RAG evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}}, {{context}}, and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate responses consistently according to your specific criteria.
      # Define our custom information_comprehensiveness metric
      information_comprehensiveness_metric = {
          "customMetricDefinition": {
              "name": "information_comprehensiveness",
              "instructions": """
              Your role is to evaluate how comprehensively the response addresses the 
      query using the retrieved information. 
              Assess whether the response provides a thorough treatment of the subject
      by effectively utilizing the available retrieved passages.
      
      Carefully evaluate the comprehensiveness of the RAG response for the given query
      against all specified criteria. 
      Assign a single overall score that best represents the comprehensiveness, and 
      provide a brief explanation justifying your rating, referencing specific strengths
      and weaknesses observed.
      
      When evaluating response comprehensiveness, consider the following rubrics:
      - Coverage: Does the response utilize the key relevant information from the 
      retrieved passages?
      - Depth: Does the response provide sufficient detail on important aspects from 
      the retrieved information?
      - Context utilization: How effectively does the response leverage the available 
      retrieved passages?
      - Information synthesis: Does the response combine retrieved information to 
      create a thorough treatment?
      
      Evaluate using the following:
      
      Query: {{prompt}}
      
      Retrieved passages: {{context}}
      
      Response to evaluate: {{prediction}}
      """,
              "ratingScale": [
                  {
                      "definition": "Very comprehensive",
                      "value": {
                          "floatValue": 3
                      }
                  },
                  {
                      "definition": "Moderately comprehensive",
                      "value": {
                          "floatValue": 2
                      }
                  },
                  {
                      "definition": "Minimally comprehensive",
                      "value": {
                          "floatValue": 1
                      }
                  },
                  {
                      "definition": "Not at all comprehensive",
                      "value": {
                          "floatValue": 0
                      }
                  }
              ]
          }
      }
    3. To create a RAG evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your knowledge base ID, generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.
      # Create the evaluation job
      retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
      
      retrieve_generate_job = bedrock_client.create_evaluation_job(
          jobName=retrieve_generate_job_name,
          jobDescription="Evaluate retrieval and generation with custom metric",
          roleArn=role_arn,
          applicationType="RagEvaluation",
          inferenceConfig={
              "ragConfigs": [{
                  "knowledgeBaseConfig": {
                      "retrieveAndGenerateConfig": {
                          "type": "KNOWLEDGE_BASE",
                          "knowledgeBaseConfiguration": {
                              "knowledgeBaseId": knowledge_base_id,
                              "modelArn": generator_model,
                              "retrievalConfiguration": {
                                  "vectorSearchConfiguration": {
                                      "numberOfResults": num_results
                                  }
                              }
                          }
                      }
                  }
              }]
          },
          outputDataConfig={
              "s3Uri": output_path
          },
          evaluationConfig={
              "automated": {
                  "datasetMetricConfigs": [{
                      "taskType": "General",
                      "dataset": {
                          "name": "RagDataset",
                          "datasetLocation": {
                              "s3Uri": input_data
                          }
                      },
                      "metricNames": [
                          "Builtin.Correctness",
                          "Builtin.Completeness",
                          "Builtin.Helpfulness",
                          "information_comprehensiveness"
                      ]
                  }],
                  "evaluatorModelConfig": {
                      "bedrockEvaluatorModels": [{
                          "modelIdentifier": evaluator_model
                      }]
                  },
                  "customMetricConfig": {
                      "customMetrics": [
                          information_comprehensiveness_metric
                      ],
                      "evaluatorModelConfig": {
                          "bedrockEvaluatorModels": [{
                              "modelIdentifier": custom_metrics_evaluator_model
                          }]
                      }
                  }
              }
          }
      )
      
      print(f"Created evaluation job: {retrieve_generate_job_name}")
      print(f"Job ID: {retrieve_generate_job['jobArn']}")
    4. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the Amazon S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions including custom metrics.

    Custom metrics are only available for LLM-as-a-judge. At the time of writing, we don’t accept custom AWS Lambda functions or endpoints for code-based custom metric evaluators. Human-based model evaluation has supported custom metric definition since its launch in November 2023.

    Clean up

    To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

    Conclusion

    The addition of custom metrics to Amazon Bedrock Evaluations empowers organizations to define their own evaluation criteria for generative AI systems. By extending the LLM-as-a-judge framework with custom metrics, businesses can now measure what matters for their specific use cases alongside built-in metrics. With support for both numerical and categorical scoring systems, these custom metrics enable consistent assessment aligned with organizational standards and goals.

    As generative AI becomes increasingly integrated into business processes, the ability to evaluate outputs against custom-defined criteria is essential for maintaining quality and driving continuous improvement. We encourage you to explore these new capabilities through the Amazon Bedrock console and API examples provided, and discover how personalized evaluation frameworks can enhance your AI systems’ performance and business impact.


    About the Authors

    Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

    Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

    Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

    Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model
    Next Article Implementing an AgentQL Model Context Protocol (MCP) Server

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 23, 2025
    Machine Learning

    Solving LLM Hallucinations in Conversational, Customer-Facing Use Cases

    June 23, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-48414 – Apache Web Interface Unauthenticated Script Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    It’s like surfing

    Learning Resources

    CVE-2025-4436 – Apache HTTP Server Remote Code Execution

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48241 – Verge3D Cross-site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-37794 – “Qualcomm Atheros Linux Wi-Fi Driver NULL Pointer Dereference Vulnerability”

    May 1, 2025

    CVE ID : CVE-2025-37794

    Published : May 1, 2025, 2:15 p.m. | 1 hour, 10 minutes ago

    Description : In the Linux kernel, the following vulnerability has been resolved:

    wifi: mac80211: Purge vif txq in ieee80211_do_stop()

    After ieee80211_do_stop() SKB from vif’s txq could still be processed.
    Indeed another concurrent vif schedule_and_wake_txq call could cause
    those packets to be dequeued (see ieee80211_handle_wake_tx_queue())
    without checking the sdata current state.

    Because vif.drv_priv is now cleared in this function, this could lead to
    driver crash.

    For example in ath12k, ahvif is store in vif.drv_priv. Thus if
    ath12k_mac_op_tx() is called after ieee80211_do_stop(), ahvif->ah can be
    NULL, leading the ath12k_warn(ahvif->ah,…) call in this function to
    trigger the NULL deref below.

    Unable to handle kernel paging request at virtual address dfffffc000000001
    KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
    batman_adv: bat0: Interface deactivated: brbh1337
    Mem abort info:
    ESR = 0x0000000096000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    FSC = 0x04: level 0 translation fault
    Data abort info:
    ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
    CM = 0, WnR = 0, TnD = 0, TagAccess = 0
    GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
    [dfffffc000000001] address between user and kernel address ranges
    Internal error: Oops: 0000000096000004 [#1] SMP
    CPU: 1 UID: 0 PID: 978 Comm: lbd Not tainted 6.13.0-g633f875b8f1e #114
    Hardware name: HW (DT)
    pstate: 10000005 (nzcV daif -PAN -UAO -TCO -DIT -SSBS BTYPE=–)
    pc : ath12k_mac_op_tx+0x6cc/0x29b8 [ath12k]
    lr : ath12k_mac_op_tx+0x174/0x29b8 [ath12k]
    sp : ffffffc086ace450
    x29: ffffffc086ace450 x28: 0000000000000000 x27: 1ffffff810d59ca4
    x26: ffffff801d05f7c0 x25: 0000000000000000 x24: 000000004000001e
    x23: ffffff8009ce4926 x22: ffffff801f9c0800 x21: ffffff801d05f7f0
    x20: ffffff8034a19f40 x19: 0000000000000000 x18: ffffff801f9c0958
    x17: ffffff800bc0a504 x16: dfffffc000000000 x15: ffffffc086ace4f8
    x14: ffffff801d05f83c x13: 0000000000000000 x12: ffffffb003a0bf03
    x11: 0000000000000000 x10: ffffffb003a0bf02 x9 : ffffff8034a19f40
    x8 : ffffff801d05f818 x7 : 1ffffff0069433dc x6 : ffffff8034a19ee0
    x5 : ffffff801d05f7f0 x4 : 0000000000000000 x3 : 0000000000000001
    x2 : 0000000000000000 x1 : dfffffc000000000 x0 : 0000000000000008
    Call trace:
    ath12k_mac_op_tx+0x6cc/0x29b8 [ath12k] (P)
    ieee80211_handle_wake_tx_queue+0x16c/0x260
    ieee80211_queue_skb+0xeec/0x1d20
    ieee80211_tx+0x200/0x2c8
    ieee80211_xmit+0x22c/0x338
    __ieee80211_subif_start_xmit+0x7e8/0xc60
    ieee80211_subif_start_xmit+0xc4/0xee0
    __ieee80211_subif_start_xmit_8023.isra.0+0x854/0x17a0
    ieee80211_subif_start_xmit_8023+0x124/0x488
    dev_hard_start_xmit+0x160/0x5a8
    __dev_queue_xmit+0x6f8/0x3120
    br_dev_queue_push_xmit+0x120/0x4a8
    __br_forward+0xe4/0x2b0
    deliver_clone+0x5c/0xd0
    br_flood+0x398/0x580
    br_dev_xmit+0x454/0x9f8
    dev_hard_start_xmit+0x160/0x5a8
    __dev_queue_xmit+0x6f8/0x3120
    ip6_finish_output2+0xc28/0x1b60
    __ip6_finish_output+0x38c/0x638
    ip6_output+0x1b4/0x338
    ip6_local_out+0x7c/0xa8
    ip6_send_skb+0x7c/0x1b0
    ip6_push_pending_frames+0x94/0xd0
    rawv6_sendmsg+0x1a98/0x2898
    inet_sendmsg+0x94/0xe0
    __sys_sendto+0x1e4/0x308
    __arm64_sys_sendto+0xc4/0x140
    do_el0_svc+0x110/0x280
    el0_svc+0x20/0x60
    el0t_64_sync_handler+0x104/0x138
    el0t_64_sync+0x154/0x158

    To avoid that, empty vif’s txq at ieee80211_do_stop() so no packet could
    be dequeued after ieee80211_do_stop() (new packets cannot be queued
    because SDATA_STATE_RUNNING is cleared at this point).

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Tariff war has tech buyers wondering what’s next. Here’s what we know

    April 7, 2025

    CVE-2025-4222 – WordPress Database Toolset Sensitive Information Exposure

    May 3, 2025

    Centralize HTTP Client Configuration with Laravel’s globalOptions Method

    April 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.