Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

AI chatbots and virtual assistants have become increasingly popular in recent years thanks the breakthroughs of large language models (LLMs). Trained on a large volume of datasets, these models incorporate memory components in their architectural design, allowing them to understand and comprehend textual context.

Most common use cases for chatbot assistants focus on a few key areas, including enhancing customer experiences, boosting employee productivity and creativity, or optimizing business processes. For instance, customer support, troubleshooting, and internal and external knowledge-based search.

Despite these capabilities, a key challenge with chatbots is generating high-quality and accurate responses. One way of solving this challenge is to use Retrieval Augmented Generation (RAG). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. Reranking seeks to improve search relevance by reordering the result set returned by a retriever with a different model. In this post, we explain how two techniquesâ€”RAG and rerankingâ€”can help improve chatbot responses using Knowledge Bases for Amazon Bedrock.

Solution overview

RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It works by first retrieving relevant responses from a database, then using those responses as context to feed the generative model to produce a final output. Using a RAG approach for building a chatbot has many advantages. For example, retrieving responses from its database before generating a response could provide more relevant and coherent responses. This helps improve the conversational flow. RAG also scales better with more data compared to pure generative models, and it doesnâ€™t require fine-tuning of the model when new data is added to the knowledge base. Additionally, the retrieval component enables the model to incorporate external knowledge by retrieving relevant background information from its database. This approach helps provide factual, in-depth, and knowledgeable responses.

To find an answer, RAG takes an approach that uses vector search across the documents. The advantage of using vector search is speed and scalability. Rather than scanning every single document to find the answer, with the RAG approach, you turn the texts (knowledge base) into embeddings and store these embeddings in the database. The embeddings are a compressed version of the documents, represented by an array of numerical values. After the embeddings are stored, the vector search queries the vector database to find the similarity based on the vectors associated with the documents. Typically, a vector search will return the top k most relevant documents based on the user question, and return the k results. However, because the similarity algorithm in a vector database works on vectors and not documents, vector search doesnâ€™t always return the most relevant information in the top k results. This directly impacts the accuracy of the response if the most relevant contexts arenâ€™t available to the LLM.

Reranking is a technique that can further improve the responses by selecting the best option out of several candidate responses. The following architecture illustrates how a reranking solution could work.

Architecture diagram for Reranking model integration with Knowledge Bases for Bedrock

Letâ€™s create a question answering solution, where we ingest The Great Gatsby, a 1925 novel by American writer F. Scott Fitzgerald. This book is publicly available through Project Gutenberg. We use Knowledge Bases for Amazon Bedrock to implement the end-to-end RAG workflow and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We then retrieve answers using standard RAG and a two-stage RAG, which involves a reranking API. We then compare results from these two methods.

The code sample is available in this GitHub repo.

In the following sections, we walk through the high-level steps:

Prepare the dataset.
Generate questions from the document using an Amazon Bedrock LLM.
Create a knowledge base that contains this book.
Retrieve answers using the knowledge base retrieve API
Evaluate the response using the RAGAS
Retrieve answers again by running a two-stage RAG, using the knowledge base retrieve API and then applying reranking on the context.
Evaluate the two-stage RAG response using the RAGAS framework.
Compare the results and the performance of each RAG approach.

For efficiency purposes, we provided sample code in a notebookÂ used to generate a set of questions and answers. These Q&A pairs are used in the RAG evaluation process. We highly recommend having a human to validate each question and answer for accuracy.

The following sections explains major steps with the help of code blocks.

Prerequisites

To clone the GitHub repository to your local machine, open a terminal window and run the following commands:

git clone https://github.com/aws-samples/amazon-bedrock-samples
cd knowledge-bases/features-examples/03-advanced-concepts/reranking

Prepare the dataset

Download the book from the Project Gutenberg website. For this post, we create 10 large documents from this book and upload them to Amazon Simple Storage Service (Amazon S3):

target_url = “https://www.gutenberg.org/ebooks/64317.txt.utf-8” # the great gatsby
data = urllib.request.urlopen(target_url)
my_texts = []
for line in data:
my_texts.append(line.decode())

doc_size = 700 # size of the document to determine number of batches
batches = math.ceil(len(my_texts) / doc_size)

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = “bedrock/knowledgebase/datasource”

start = 0
s3 = boto3.client(“s3”)
for batch in range(batches):
Â Â Â batch_text_arr = my_texts[start:start+doc_size]
Â Â Â batch_text = “”.join(batch_text_arr)
Â Â Â s3.put_object(
Â Â Â Â Â Body=batch_text,
Â Â Â Â Â Bucket=default_bucket,
Â Â Â Â Â Key=f”{s3_prefix}/{start}.txt”
Â Â Â )
Â Â Â start += doc_size Â

Create Knowledge Base for Bedrock

If youâ€™re new to using Knowledge Bases for Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock now supports Amazon Aurora PostgreSQL and Cohere embedding models, where we described how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.

In this step, you create a knowledge base using a Boto3 client. You use Amazon Titan Text Embedding v2 to convert the documents into embeddings (â€˜embeddingModelArnâ€™) and point to the S3 bucket you created earlier as the data source (dataSourceConfiguration):

bedrock_agent = boto3.client(“bedrock-agent”)
response = bedrock_agent.create_knowledge_base(
name=knowledge_base_name,
description=’Knowledge Base for Bedrock’,
roleArn=role_arn,
knowledgeBaseConfiguration={
‘type’: ‘VECTOR’,
‘vectorKnowledgeBaseConfiguration’: {
’embeddingModelArn’: embedding_model_arn
}
},
storageConfiguration={
‘type’: ‘OPENSEARCH_SERVERLESS’,
‘opensearchServerlessConfiguration’: {
‘collectionArn’: collection_arn,
‘vectorIndexName’: index_name,
‘fieldMapping’: {
‘vectorField’: Â “bedrock-knowledge-base-default-vector”,
‘textField’: ‘AMAZON_BEDROCK_TEXT_CHUNK’,
‘metadataField’: ‘AMAZON_BEDROCK_METADATA’
}
}
}
)
knowledge_base_id = response[‘knowledgeBase’][‘knowledgeBaseId’]
knowledge_base_name = response[‘knowledgeBase’][‘name’]

response = bedrock_agent.create_data_source(
knowledgeBaseId=knowledge_base_id,
name=f”{knowledge_base_name}-ds”,
dataSourceConfiguration={
‘type’: ‘S3’,
‘s3Configuration’: {
‘bucketArn’: f”arn:aws:s3:::{bucket}”,
‘inclusionPrefixes’: [
f”{s3_prefix}/”,
]
}
},
vectorIngestionConfiguration={
‘chunkingConfiguration’: {
‘chunkingStrategy’: ‘FIXED_SIZE’,
‘fixedSizeChunkingConfiguration’: {
‘maxTokens’: 300,
‘overlapPercentage’: 10
}
}
}
)
data_source_id = response[‘dataSource’][‘dataSourceId’]

response = bedrock_agent.start_ingestion_job(
knowledgeBaseId=knowledge_base_id,
dataSourceId=data_source_id,
)

Generate questions from the document

We use Anthropic Claude on Amazon Bedrock to generate a list of 10 questions and the corresponding answers. The Q&A data serves as the foundation for the RAG evaluation based on the approaches that we are going to implement. We define the generated answers from this step as ground truth data. See the following code:

prompt_template = “””The question should be diverse in nature
across the document. The question should not contain options, not start with Q1/ Q2.
Restrict the question to the context information provided.

<document>
{{document}}
</document>

Think step by step and pay attention to the number of question to create.

Your response should follow the format as followed:

Question: question
Answer: answer

“””
system_prompt = “””You are a professor. Your task is to setup 1 question for an upcoming
quiz/examination based on the given document wrapped in <document></document> XML tag.”””

prompt = prompt_template.replace(“{{document}}”, documents)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
# Base inference parameters to use.
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
# Additional inference parameters to use.
additional_model_fields = {“top_k”: top_k}

# Send the message.
response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
system=[{“text”: system_prompt}],
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
print(response[‘output’][‘message’][‘content’][0][‘text’])
result = response[‘output’][‘message’][‘content’][0][‘text’]
q_pos = [(a.start(), a.end()) for a in list(re.finditer(“Question:”, result))]
a_pos = [(a.start(), a.end()) for a in list(re.finditer(“Answer:”, result))]

Retrieve answers using the knowledge base APIs

We use the generated questions and retrieve answers from the knowledge base using the retrieve and converse APIs:

contexts = []
answers = []

for question in questions:
response = agent_runtime.retrieve(
knowledgeBaseId=knowledge_base_id,
retrievalQuery={
‘text’: question
},
retrievalConfiguration={
‘vectorSearchConfiguration’: {
‘numberOfResults’: topk
}
}
)

retrieval_results = response[‘retrievalResults’]
local_contexts = []
for result in retrieval_results:
local_contexts.append(result[‘content’][‘text’])
contexts.append(local_contexts)
combined_docs = “n”.join(local_contexts)
prompt = llm_prompt_template.replace(“{{documents}}”, combined_docs)
prompt = prompt.replace(“{{query}}”, question)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
# Base inference parameters to use.
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
# Additional inference parameters to use.
additional_model_fields = {“top_k”: top_k}

# Send the message.
response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
answers.append(response[‘output’][‘message’][‘content’][0][‘text’])

Evaluate the RAG response using the RAGAS framework

We now evaluate the effectiveness of the RAG using a framework called RAGAS. The framework provides a suite of metrics to evaluate different dimensions. In our example, we evaluate responses based on the following dimensions:

Answer relevancy â€“ This metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0â€“1, where higher scores indicate better relevancy.
Answer similarity â€“ This assesses the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0â€“1. A higher score signifies a better alignment between the generated answer and the ground truth.
Context relevancy â€“ This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of 0â€“1, with higher values indicating better relevancy.
Answer correctness â€“ The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0â€“1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

A summarized report for standard RAG approach based on RAGAS evaluation:

answer_relevancy: 0.9006225160334027

answer_similarity: 0.7400904157096762

answer_correctness: 0.32703043056663855

context_relevancy: 0.024797687553157175

Two-stage RAG: Retrieve and rerank

Now that you have the results with the retrieve_and_generate API, letâ€™s explore the two-stage retrieval approach by extending the standard RAG approach to integrate with a reranking model. In the context of RAG, reranking models are used after an initial set of contexts are retrieved by the retriever. The reranking model takes in the list of results and reranks each one based on the similarity between the context and the user query. In our example, we use a powerful reranking model called bge-reranker-large. The model is available in the Hugging Face Hub and is also free for commercial use. In the following code, we use the knowledge baseâ€™s retrieve API so we can get the handle on the context, and rerank it using the reranking model deployed as an Amazon SageMaker endpoint. We provide the sample code for deploying the reranking model in SageMaker in the GitHub repository. Hereâ€™s a code snippet that demonstrates two-stage retrieval process:

def generate_two_stage_context_answers(bedrock_runtime,
agent_runtime,
model_id,
knowledge_base_id,
retrieval_topk,
reranking_model,
questions,
rerank_top_k=3):
contexts = []
answers = []
predictor = Predictor(endpoint_name=reranking_model, serializer=JSONSerializer(), deserializer=JSONDeserializer())
for question in questions:
retrieval_results = two_stage_retrieval(agent_runtime, knowledge_base_id, question, retrieval_topk, predictor, rerank_top_k)
local_contexts = []
documents = []
for result in retrieval_results:
local_contexts.append(result)

contexts.append(local_contexts)
combined_docs = “n”.join(local_contexts)
prompt = llm_prompt_template.replace(“{{documents}}”, combined_docs)
prompt = prompt.replace(“{{query}}”, question)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
additional_model_fields = {“top_k”: top_k}

response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
answers.append(response[‘output’][‘message’][‘content’][0][‘text’])
return contexts, answers

Evaluate the two-stage RAG response using the RAGAS framework

We evaluate the answers generated by the two-stage retrieval process. The following is a summarized report based on RAGAS evaluation:

answer_relevancy: 0.841581671275458

answer_similarity: 0.7961827348349313

answer_correctness: 0.43361356731293665

context_relevancy: 0.06049484724216884

Compare the results

Letâ€™s compare the results from our tests. As shown in the following figure, the reranking API improves context relevancy, answer correctness, and answer similarity, which are important for improving the accuracy of the RAG process.

RAG vs Two Stage Retrieval evaluation metrics

Similarly, we also measured the RAG latency for both approaches. The results can be shown in the following metrics and the corresponding chart:

Standard RAG latency: 76.59s

Two Stage Retrieval latency: 312.12s

Latency metric for RAG and Two Stage Retrieval process

In summary, using a reranking model (tge-reranker-large) hosted on an ml.m5.xlarge instance yields approximately four times the latency compared to the standard RAG approach. We recommend testing with different reranking model variants and instance types to obtain the optimal performance for your use case.

Conclusion

In this post, we demonstrated how to implement a two-stage retrieval process by integrating a reranking model. We explored how integrating a reranking model with Knowledge Bases for Amazon Bedrock can provide better performance. Finally, we used RAGAS, an open source framework, to provide context relevancy, answer relevancy, answer similarity, and answer correctness metrics for both approaches.

Try out this retrieval process today, and share your feedback in the comments.

About the Author

Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisorâ€™s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Mani Khanuja is a Tech Lead â€“ Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

New Xbox games launching this week, from May 19 through May 25 — Onimusha 2 remaster arrives

5 ways you can plug the widening AI skills gap at your business

I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

Gears of War: Reloaded — Release date, price, and everything you need to know

YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

Big Changes at Meteor Software: Our Next Chapter

New Xbox games launching this week, from May 19 through May 25 — Onimusha 2 remaster arrives

New Xbox games launching this week, from May 19 through May 25 — Onimusha 2 remaster arrives

Windows 11 KB5058411 install fails, File Explorer issues (May 2025 Update)

Microsoft Edge could integrate Phi-4 mini to enable “on device” AI on Windows 11

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

Solution overview

Prerequisites

Prepare the dataset

Create Knowledge Base for Bedrock

Generate questions from the document

Retrieve answers using the knowledge base APIs

Evaluate the RAG response using the RAGAS framework

Two-stage RAG: Retrieve and rerank

Evaluate the two-stage RAG response using the RAGAS framework

Compare the results

Conclusion

About the Author

LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency

Cohere Released Command A: A 111B Parameter AI Model with 256K Context Length, 23-Language Support, and 50% Cost Reduction for Enterprises

50 Best Websites for Web Design Inspiration and Ideas

CTO’s Guide to Effective Technology Roadmaps

Rebranded VS Code extension Eyecons – an icon theme, where icon colors are adapted to the colors theme

TRANSMI: A Machine Learning Framework to Create Baseline Models Adapted for Transliterated Data from Existing Multilingual Pretrained Language Models mPLMs without Any Training

Top AI Tools for â€˜Film Directors and Producersâ€™

Universal Design in Pharmacies – Digital Accessibility

Bitrix24 Review: Comprehensive CRM and Workspace Solution

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

Solution overview

Prerequisites

Prepare the dataset

Create Knowledge Base for Bedrock

Generate questions from the document

Retrieve answers using the knowledge base APIs

Evaluate the RAG response using the RAGAS framework

Two-stage RAG: Retrieve and rerank

Evaluate the two-stage RAG response using the RAGAS framework

Compare the results

Conclusion

About the Author

Related Posts