Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Metadata filtering for tabular data with Knowledge Bases for Amazon Bedrock

    Metadata filtering for tabular data with Knowledge Bases for Amazon Bedrock

    July 26, 2024

    Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. To equip FMs with up-to-date and proprietary information, organizations use Retrieval Augmented Generation (RAG), a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation. However, information about one dataset can be in another dataset, called metadata. Without using metadata, your retrieval process can cause the retrieval of unrelated results, thereby decreasing FM accuracy and increasing cost in the FM prompt token.

    On March 27, 2024, Amazon Bedrock announced a key new feature called metadata filtering and also changed the default engine. This change allows you to use metadata fields during the retrieval process. However, the metadata fields need to be configured during the knowledge base ingestion process. Often, you might have tabular data where details about one field are available in another field. Also, you could have a requirement to cite the exact text document or text field to prevent hallucination. In this post, we show you how to use the new metadata filtering feature with Knowledge Bases for Amazon Bedrock for such tabular data.

    Solution overview

    The solution consists of the following high-level steps:

    Prepare data for metadata filtering.
    Create and ingest data and metadata into the knowledge base.
    Retrieve data from the knowledge base using metadata filtering.

    Prepare data for metadata filtering

    As of this writing, Knowledge Bases for Amazon Bedrock supports Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise, and MongoDB Atlas as underlying vector store providers. In this post, we create and access an OpenSearch Serverless vector store using the Amazon Bedrock Boto3 SDK. For more details, see Set up a vector index for your knowledge base in a supported vector store.

    For this post, we create a knowledge base using the public dataset Food.com – Recipes and Reviews. The following screenshot shows an example of the dataset.

    The TotalTime is in ISO 8601 format. You can convert that to minutes using the following logic:

    # Function to convert ISO 8601 duration to minutes
    def convert_to_minutes(duration):
    hours = 0
    minutes = 0

    # Find hours and minutes using regex
    match = re.match(r’PT(?:(d+)H)?(?:(d+)M)?’, duration)

    if match:
    if match.group(1):
    hours = int(match.group(1))
    if match.group(2):
    minutes = int(match.group(2))

    # Convert total time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

    df[‘TotalTimeInMinutes’] = df[‘TotalTime’].apply(convert_to_minutes)

    After converting some of the features like CholesterolContent, SugarContent, and RecipeInstructions, the data frame looks like the following screenshot.

    To enable the FM to point to a specific menu with a link (cite the document), we split each row of the tabular data in a single text file, with each file containing RecipeInstructions as the data field and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. The metadata should be kept in a separate JSON file with the same name as the data file and .metadata.json added to its name. For example, if the data file name is 100.txt, the metadata file name should be 100.txt.metadata.json. For more details, see Add metadata to your files to allow for filtering. Also, the content in the metadata file should be in the following format:

    {
    “metadataAttributes”: {
    “${attribute1}”: “${value1}”,
    “${attribute2}”: “${value2}”,
    …
    }
    }

    For the sake of simplicity, we only process the top 2,000 rows to create the knowledge base.

    After you import the necessary libraries, create a local directory using the following Python code:

    import pandas as pd
    import os, json, tqdm, boto3

    metafolder = ‘multi_file_recipe_data’os.mkdir(metafolder)

    Iterate over the top 2,000 rows to create data and metadata files to store in the local folder:

    for i in tqdm.trange(2000):
    desc = str(df[‘RecipeInstructions’][i])
    meta = {
    “metadataAttributes”: {
    “Name”: str(df[‘Name’][i]),
    “TotalTimeInMinutes”: str(df[‘TotalTimeInMinutes’][i]),
    “CholesterolContent”: str(df[‘CholesterolContent’][i]),
    “SugarContent”: str(df[‘SugarContent’][i]),
    }
    }
    filename = metafolder+’/’ + str(i+1)+ ‘.txt’
    f = open(filename, ‘w’)
    f.write(desc)
    f.close()
    metafilename = filename+’.metadata.json’
    with open( metafilename, ‘w’) as f:
    json.dump(meta, f)

    Create an Amazon Simple Storage Service (Amazon S3) bucket named food-kb and upload the files:

    # Upload data to s3
    s3_client = boto3.client(“s3”)
    bucket_name = “recipe-kb”
    data_root = metafolder+’/’
    def uploadDirectory(path,bucket_name):
    for root,dirs,files in os.walk(path):
    for file in tqdm.tqdm(files):
    s3_client.upload_file(os.path.join(root,file),bucket_name,file)

    uploadDirectory(data_root, bucket_name)

    Create and ingest data and metadata into the knowledge base

    When the S3 folder is ready, you can create the knowledge base on the Amazon Bedrock console using the SDK according to this example notebook.

    Retrieve data from the knowledge base using metadata filtering

    Now let’s retrieve some data from the knowledge base. For this post, we use Anthropic Claude Sonnet on Amazon Bedrock for our FM, but you can choose from a variety of Amazon Bedrock models. First, you need to set the following variables, where kb_id is the ID of your knowledge base. The knowledge base ID can be found programmatically, as shown in the example notebook, or from the Amazon Bedrock console by navigating to the individual knowledge base, as shown in the following screenshot.

    Set the required Amazon Bedrock parameters using the following code:

    import boto3
    import pprint
    from botocore.client import Config
    import json

    pp = pprint.PrettyPrinter(indent=2)
    session = boto3.session.Session()
    region = session.region_name
    bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={‘max_attempts’: 0})
    bedrock_client = boto3.client(‘bedrock-runtime’, region_name = region)
    bedrock_agent_client = boto3.client(“bedrock-agent-runtime”,
    config=bedrock_config, region_name = region)
    kb_id = “EIBBXVFDQP”
    model_id = ‘anthropic.claude-3-sonnet-20240229-v1:0’

    # retrieve api for fetching only the relevant context.

    query = ” Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10 “

    relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
    ‘text’: query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
    ‘vectorSearchConfiguration’: {
    ‘numberOfResults’: 2
    }
    }
    )
    pp.pprint(relevant_documents[“retrievalResults”])

    The following code is the output of the retrieval from the knowledge base without metadata filtering for the query “Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10.” As we can see, out of the two recipes, the preparation durations are 30 and 480 minutes, respectively, and the cholesterol contents are 86 and 112.4, respectively. Therefore, the retrieval isn’t following the query accurately.

    The following code demonstrates how to use the Retrieve API with the metadata filters set to a cholesterol content less than 10 and minutes of preparation less than 30 for the same query:

    def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
    retrievalQuery= {
    ‘text’: query
    },
    knowledgeBaseId=kbId,
    retrievalConfiguration= {
    ‘vectorSearchConfiguration’: {
    ‘numberOfResults’: numberOfResults,
    “filter”: {
    ‘andAll’:[
    {
    “lessThan”: {
    “key”: “CholesterolContent”,
    “value”: 10
    }
    },
    {
    “lessThan”: {
    “key”: “TotalTimeInMinutes”,
    “value”: 30
    }
    }
    ]
    }
    }
    }
    )
    query = “Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10”
    response = retrieve(query, kb_id, 2)
    retrievalResults = response[‘retrievalResults’]
    pp.pprint(retrievalResults)

    As we can see in the following results, out of the two  recipes, the preparation times are 27 and 20, respectively, and the cholesterol contents are 0 and 0, respectively. With the use of metadata filtering, we get more accurate results.

    The following code shows how to get accurate output using the same metadata filtering with the retrieve_and_generate API. First, we set the prompt, then we set up the API with metadata filtering:

    prompt = f”””
    Human: You have great knowledge about food, so provide answers to questions by using fact.
    If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.

    Assistant:”””

    def retrieve_and_generate(query, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
    input= {
    ‘text’: query,
    },
    retrieveAndGenerateConfiguration={
    ‘knowledgeBaseConfiguration’: {
    ‘generationConfiguration’: {
    ‘promptTemplate’: {
    ‘textPromptTemplate’: f”{prompt} $search_results$”
    }
    },
    ‘knowledgeBaseId’: kb_id,
    ‘modelArn’: model_id,
    ‘retrievalConfiguration’: {
    ‘vectorSearchConfiguration’: {
    ‘numberOfResults’: numberOfResults,
    ‘overrideSearchType’: ‘HYBRID’,
    “filter”: {
    ‘andAll’:[
    {
    “lessThan”: {
    “key”: “CholesterolContent”,
    “value”: 10
    }
    },
    {
    “lessThan”: {
    “key”: “TotalTimeInMinutes”,
    “value”: 30
    }
    }
    ]
    },
    }
    }
    },
    ‘type’: ‘KNOWLEDGE_BASE’
    }
    )

    query = “Tell me a recipe that I can make under 30 minutes and has cholesterol less than 10”
    response = retrieve_and_generate(query, kb_id,modelId, numberOfResults=10)
    pp.pprint(response[‘output’][‘text’])

    As we can see in the following output, the model returns a detailed recipe that follows the instructed metadata filtering of less than 30 minutes of preparation time and a cholesterol content less than 10.

    Clean up

    Make sure to comment the following section if you’re planning to use the knowledge base that you created for building your RAG application. If you only wanted to try out creating the knowledge base using the SDK, make sure to delete all the resources that were created because you will incur costs for storing documents in the OpenSearch Serverless index. See the following code:

    bedrock_agent_client.delete_data_source(dataSourceId = ds[“dataSourceId”], knowledgeBaseId=kb[‘knowledgeBaseId’])
    bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb[‘knowledgeBaseId’])
    oss_client.indices.delete(index=index_name)
    aoss_client.delete_collection(id=collection_id)
    aoss_client.delete_access_policy(type=”data”, name=access_policy[‘accessPolicyDetail’][‘name’])
    aoss_client.delete_security_policy(type=”network”, name=network_policy[‘securityPolicyDetail’][‘name’])
    aoss_client.delete_security_policy(type=”encryption”, name=encryption_policy[‘securityPolicyDetail’][‘name’])
    # Delete roles and polices
    iam_client.delete_role(RoleName=bedrock_kb_execution_role)
    iam_client.delete_policy(PolicyArn=policy_arn)

    Conclusion

    In this post, we explained how to split a large tabular dataset into rows to set up a knowledge base with metadata for each of those records, and how to then retrieve outputs with metadata filtering. We also showed how retrieving results with metadata is more accurate than retrieving results without metadata filtering. Lastly, we showed how to use the result with an FM to get accurate results.

    To further explore the capabilities of Knowledge Bases for Amazon Bedrock, refer to the following resources:

    Knowledge bases for Amazon Bedrock
    Amazon Bedrock Knowledge Base – Samples for building RAG workflows

    About the Author

    Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services. He helps customers to solve their business problem using Generative AI and Machine Learning.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIntelligent document processing using Amazon Bedrock and Anthropic Claude
    Next Article Secure AccountantAI Chatbot: Lili’s journey with Amazon Bedrock

    Related Posts

    Machine Learning

    LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    May 17, 2025
    Machine Learning

    This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best Free Adobe Premiere Templates

    Development

    I always turn off this default TV setting when watching movies – here’s why you should, too

    News & Updates

    Wikileaks’ Julian Assange Released from U.K. Prison, Heads to Australia

    Development

    Building interactive agents in video game worlds

    Artificial Intelligence

    Highlights

    Vue3-signature

    July 26, 2024

    Electronic signature plugin for Vue.js 3 Continue reading on Vue.js Feed » Source: Read More

    Meet Depot: A Developer Focused Startup with an AI-Powered Approach to Faster Docker Builds

    April 9, 2024

    Fox Mowing NSW

    March 24, 2025

    If you bought an RTX 5090 or RTX 5080 before stock ran out, you need to grab this NVIDIA driver

    February 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.