This post is co-written with Kristina Olesova, Zdenko Esetok, and Selimcan akar from Accenture.
In today’s data-driven world, organizations often face the challenge of extracting structured information from unstructured PDF documents. These PDFs can contain a myriad of elements, such as images, tables, headers, and text formatted in various styles, making it difficult to parse and analyze the data efficiently.
Additionally, the performance of chatbots and other natural language processing (NLP) applications depends heavily on the chunking strategy employed. Improper chunking can lead to loss of context, resulting in hallucinations and inaccurate responses. Also, the performance of language models is further influenced by the chunk size, where smaller chunks provide more granular information but struggle with generalization, whereas larger chunks might miss important details.
This post explores how Accenture used the customization capabilities of Knowledge Bases for Amazon Bedrock to incorporate their data processing workflow and custom logic to create a custom chunking mechanism that enhances the performance of Retrieval Augmented Generation (RAG) and unlock the potential of your PDF data.
Solution overview
The Accenture team created a knowledge base with the financial results of Accenture for every quarter from 2020–2024. This document contained images, tables, text stored in different formats, and other noise elements.
In this use case, we wanted to extract granular information contained in the tables and also preserve the good generalization capabilities of foundation models (FMs) to respond to general questions about financial results.
After testing, we found that the search mechanism wasn’t able to correctly retrieve the information for the years and quarters specified in the prompt. The following screenshot shows an example where the query was for information from the first quarter of 2023, but the search mechanism returned information from the first quarter of 2020.
We couldn’t extract the correct chunk of data using different search strategies or by changing the number of retrieved chunks. After more vigorous testing, we identified struggles with parsing the tabular information and retrieving the correct data. Because the issues were related to the inability of the search algorithm to select the correct chunks, we decided to change the chunking strategy and try the new features in Amazon Bedrock.
The architectural flow of the updated solution is as follows:
Begin by creating a data source with all the data stored in Amazon Simple Storage Service (Amazon S3) or another database. This can include custom PDFs with tables, forms, and other complex elements.
Run Amazon Textract on the PDFs stored in your data source. Amazon Textract is a highly accurate service that can extract text, tables, and other data from virtually any document.
Create chunks based on the extractions from paragraphs in the Amazon Textract output. For every chunk, include additional metadata such as chapter titles and document names to preserve context.
Embed the chunked files into vectors using the console for Knowledge Bases for Amazon Bedrock. Select no chunking while creating a vector representation of chunks.
Set up the system prompt, search strategies, number of chunks, and metadata filtering if applicable and ask the user for a question.
Use the vector-search feature of Amazon OpenSearch Service to select the most similar embedded chunks to the user query (prompt)prompt.
Call a FM from Amazon Bedrock on the chunks provided by OpenSearch Service and get the answer.
The steps in the workflow are orchestrated using AWS Lambda, as shown in the following diagram.
The chunking mechanism uses Amazon Textract to detect paragraphs, tables, images, chapter titles, and other PDF layout elements to improve the chunking (without splitting the text in the middle of a sentence or paragraph), eliminate noise, and provide more context for metadata generation. We can use this metadata directly during filtration or as a hint in a prompt template to improve the accuracy of the generated response. Using the specified logic for every PDF element, we can take the correct actions depending on the category of the element.
The main PDF elements are as follows:
Tables – Tables are the most difficult layout elements in a PDF. The information can be correctly extracted only when headers and column names are correctly identified. This is difficult to achieve with fixed size chunking because there is no way to guarantee that headers will be present in the chunk, together with all the row information. We can use table detection to extract a table and save it in a CSV file, or even directly use it in a database as a data source for agents.
Images – If the text contains images connected to user instructions, the images can be detected and tagged during preprocessing. Later, these images can be stored in Amazon S3 and displayed in a chat window using relevant tags.
Page numbers, headers, and footers – This text information doesn’t bring any valuable information for RAG models, and it can confuse them significantly. Moreover, storing page headers and footers can take up significant space in the vector database and incur significant cost with negligible benefits.
Chapter titles and subtitles – In many documents, chapter titles describe the context of the chapter. This information can help us tag the chunks using metadata, or directly include this information in the filtering process, thereby improving the accuracy and speed of extraction.
Use custom chunking with Knowledge Bases in Amazon Bedrock
In this section, we demonstrate how to use the proposed custom chunking solution.
Note: Keep in mind that the content and code provided is for informational purpose only. You should do an independent assessment before running anything in response to the information that follows.
This involves the following steps:
Specify the custom metadata for every financial document that you want to include in the analysis. For this post, we specified the information for quarter, fiscal year, company, and other fields:
metadata = {
“metadataAttributes”: {
“document_name”: document_name.split(“.pdf”)[0],
“fiscal_year”: fiscal_year,
“quarter”:quarter,
“main_topic”: “”,
“secondary_topic”: ” “,
“format”: “Text”
}
}
Split the PDF files into multiple images or single PDF files. It’s important to have high resolution to properly distinguish all the characters within the files.
Invoke Amazon Textract to detect the layout items and table items:
def textract_data(self,output):
image = Image.open(output)
document = self.extractor.analyze_document(
file_source=image,
features=[TextractFeatures.LAYOUT,TextractFeatures.TABLES],
save_image=True
)
new_layout=self.save_table(document)
self.save_text(new_layout)
Save the table information. In this example, we’re using Anthropic’s Claude models, which are able to correctly parse files in CSV format. Export all the tables detected as a CSV, and save the table names and specified table format as additional metadata:
Further processing is required for information other than tables and images. We create metadata tags containing the information about main chapter titles and subtitles. This information can help you boost performance using metadata filtering or during vector search using a system prompt. For every chunk of data, specify within the metadata to which chapter and subchapter it belongs. Ideally, you should always have one chunk of data for every subchapter, but this isn’t always possible. Many subchapters are too long to be parsed with one chunk. In such cases, you can split the text after the paragraph and use the same metadata for another chunk:
The benefit of this method is that, even if the text continues on the next page, this mechanism is able to assign it to the correct chunk (if the text is within the limited vector space). This helps prevent splitting the text in the middle of a sentence, which can often lead to hallucinations.
After the text is split, create two files for every chunk:
A .txt chunk file together with the metadata string.
A metadata.json file that can be used with the knowledge base metadata and filtering.
When the split is complete, upload the files to Amazon S3 and continue with creating the knowledge base using the no chunking option.
When using the custom chunk option, keep in mind the maximum size of possible chunks. If the text chunk is too large, the vectorization of the files will fail, and the file won’t be available for the knowledge base.
Benefits of custom chunking
Custom chunking offers the following benefits:
Context preservation – By chunking text based on chapters or subchapters, you can make sure that the context of each section remains relevant throughout the chunk, resulting in more accurate vector representations and reducing noise.
Flexible chunk sizes – Custom chunking allows you to dynamically adjust the chunk sizes, addressing the challenge of selecting the optimal chunk size for different use cases.
Improved retrieval performance – With custom chunking and the advanced retrieval capabilities of Amazon Bedrock such as metadata filtering, you can significantly enhance the performance of your retrieval frameworks, enabling faster and more accurate insights.
Seamless integration – Amazon Bedrock seamlessly integrates with other AWS services, such as Amazon S3 and Amazon Textract, providing a streamlined solution for data extraction, organization, and analysis.
Metadata filtering compared to system prompts
Metadata filtering is a powerful feature that significantly enhances the search algorithm’s performance. By using metadata filtering to specify fiscal years and quarters, we achieved notable improvements in response accuracy. Currently, the Amazon Bedrock console requires users to have prior knowledge of metadata filter names and their corresponding values. As of this writing, direct specification of these filters through prompts isn’t supported. Consequently, in practical applications, users would benefit from guidance or hints to assist them in selecting appropriate filter values.
The following figure shows an example of enabling metadata filtering for the same model and chunking logic. In the first question, using only the prompts, the search algorithm failed to provide chunks from the correct documents. In the second question, we filtered by fiscal year (2023) and quarter (Q3). The output of the search algorithm was just one chunk, but the correct one.
Performance comparison
We compared fixed chunking, custom chunking, and custom chunking with prompts. For vectorization, we used the Amazon Titan Embeddings Text v1 model for custom chunking, baseline, and metadata filtering. We performed additional knowledge base testing with Cohere. We performed all the testing with the Claude Sonnet 3 model and hybrid search, with a maximum retrieved result of 20.
We tested the performance of the models on several tasks:
Table information – Information only extractable from tables.
Long questions – Summarizing chapters using multiple chunks. This is a difficult task for models with a small embedding window.
Year-specific questions – The answers are very short and clear, but the correct extraction relies on the capability of vector search to determinate the time span from the user question and extract the chunk corresponding to a given time span.
We evaluated the performance manually by checking factually against the information generated by the model with the source data. The following screenshots show some example questions and answers generated on two different knowledge bases for the year_sensitive class.
The first example uses custom chunking with an Amazon Titan Embeddings model.
The next example uses Cohere with fixed chunking.
We used the prompt template feature released in April 2024 to focus the model on detailed information regarding the fiscal years and quarters. This information was the same as it was in the metadata JSON file, and it gives the models some guidelines about what information is important for extracting the valid chunks. The following is an example of the system prompt:
The adjusted prompt template improved the accuracy of the results. For the knowledge base created with an Amazon Titan Embeddings model and fixed chunking, the accuracy of extracted results increased to 70 percent accuracy. This number served as a baseline for our evaluation.
After switching from fixed chunking to custom chunking with Amazon Titan, the accuracy of retrieved results increased by 17 percent.
Interestingly, Cohere led to similar results as using custom chunking with regards to response accuracy, but showed slightly less precise richness in summarization (long answers).
Summarization means condensing a long piece of text while retaining its essential information and meaning by capturing the main points, key ideas, and important details.
The following screenshots show some sample answers in the long answers category. The first example is the output from Cohere.
The following is the output using custom chunking.
Cohere uses smaller chunks of text for embedding, which make it more precise, but it struggles to provide a detailed summary. The responses aren’t inaccurate, but they often miss important details and the created answers are slightly ambiguous.
The biggest advantage of custom chunking is that saving the chunks with variable size helped us improve the accuracy of the model (compared to the original Amazon Titan Embeddings model). We also preserved the good summarization capabilities of the models by using bigger chunks when possible. Overall, the best performance was achieved using metadata filtering.
We applied metadata filtering only to the questions where it was applicable (where the user was asking about the specific year or quarter). It didn’t help in cases where the question was asking the model to extract information from multiple years (like the number of employees in every year or the revenue in every quarter). However, it’s still a great tool that can improve results significantly.
Clean up
As you conclude your journey through setting up and using the knowledge base in this post, it’s essential to clean up the resources you created, so your environment is clean and cost-efficient.
Decommission OpenSearch Service
First, you need to decommission OpenSearch Service. This process involves safely shutting down your OpenSearch instances to prevent any unintended data retention or unnecessary costs:
On the OpenSearch Service console, navigate to your domain.
Delete the domain and confirm the deletion when prompted.
Empty and delete the S3 bucket
Next, delete the S3 bucket that stored your data:
On the Amazon S3 console, navigate to your S3 bucket.
Delete the files to empty the bucket.
Delete the bucket, confirming the deletion when prompted to permanently remove the storage resource.
Delete the Lambda function
Finally, you need to delete the Lambda function created for this project:
On the Lambda console, select your function and choose Delete.
Confirm the deletion to remove the function and free up resources.
By following these steps, you have cleaned up the resources created during this post, maintaining a lean and cost-effective AWS environment. This not only helps in managing your resources better, but also makes sure that you’re only paying for what you use.
Conclusion
By combining the power of Knowledge Bases for Amazon Bedrock with custom chunking mechanisms and the advanced data extraction capabilities of Amazon Textract, organizations can unlock the true potential of their PDF data. Furthermore, using a knowledge base with custom chunking for different models provides holistic evaluation of models quickly. This solution helps you achieve accurate and contextual responses, improves the performance of retrieval frameworks, and enables efficient data extraction from unstructured PDF documents.
The joint effort between Accenture and AWS discussed in this post builds on the 15-year strategic relationship between the companies and uses the same proven mechanisms and accelerators built by the Accenture AWS Business Group (AABG). Connect with the AABG team at accentureaws@amazon.com to drive business outcomes by transforming to an intelligent data enterprise on AWS.
For more information about generative AI on AWS using Amazon Bedrock or Amazon SageMaker, we recommend the following resources:
Generative AI on AWS: Technology
Get started with generative AI on AWS using Amazon SageMaker JumpStart
You can also sign up for the AWS generative AI newsletter, which includes educational resources, blogs, and service updates.
Thank you for following along, and happy coding!
About the Authors
Kristina Olesova works as a Data Scientist at Accenture. She is focused primarily on computer vision and generative AI. Outside of work, she likes to read books and hike in the mountains.
Zdenko Estok works as a cloud architect and DevOps engineer at Accenture. He works with AABG to develop and implement innovative cloud solutions, and specializes in infrastructure as code and cloud security. Zdenko likes to bike to the office and enjoys pleasant walks in nature.
Selimcan “Can†Sakar is a cloud-first developer and solution architect at Accenture with a focus on artificial intelligence and a passion for watching models converge.
Shikhar Kwatra is a Sr. Partner Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and support the GSI partners in building strategic industry solutions on AWS.
Marcelo Silva is a Principal Product Manager at Amazon Web Services leading strategy and growth for Knowledge Bases for Amazon Bedrock and Amazon Lex. His passion is helping customers harness the power of conversational AI and generative AI solutions to drive business outcomes and growth.
Source: Read MoreÂ