Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 22, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 22, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 22, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 22, 2025

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025

      How to get started with Microsoft Copilot on Windows 11

      May 22, 2025

      Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

      May 22, 2025

      I missed out on the Clair Obscur: Expedition 33 Collector’s Edition but thankfully, the developers are launching something special

      May 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Perficient is Shaping the Future of Salesforce Innovation

      May 22, 2025
      Recent

      Perficient is Shaping the Future of Salesforce Innovation

      May 22, 2025

      Opal – Optimizely’s AI-Powered Marketing Assistant

      May 22, 2025

      Content Compliance Without the Chaos: How Optimizely CMP Empowers Financial Services Marketers

      May 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025
      Recent

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025

      How to get started with Microsoft Copilot on Windows 11

      May 22, 2025

      Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

      May 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Talk to your slide deck using multimodal foundation models on Amazon Bedrock – Part 3

    Talk to your slide deck using multimodal foundation models on Amazon Bedrock – Part 3

    December 17, 2024

    In this series, we share two approaches to gain insights on multimodal data like text, images, and charts. In Part 1, we presented an “embed first, infer later” solution that uses the Amazon Titan Multimodal Embeddings foundation model (FM) to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. Part 1 uses AWS services including Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless.

    In Part 2, we demonstrated a different approach: “infer first, embed later.” We used Anthropic’s Claude 3 Sonnet on Amazon Bedrock to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using the Amazon Titan Text Embeddings model and stored in a vector database. Then we used Anthropic’s Claude 3 Sonnet to generate answers to user questions based on the most relevant text description retrieved from the vector database.

    In this post, we evaluate the results from both approaches using ground truth provided by SlideVQA[1], an open source visual question answering dataset. You can test both approaches and evaluate the results to find the best fit for your datasets. The code for this series is available in the GitHub repo.

    Comparison of approaches

    SlideVQA is a collection of publicly available slide decks, each composed of multiple slides (in JPG format) and questions based on the information in the slide decks. It allows a system to select a set of evidence images and answer the question. We use SlideVQA as the single source of truth to compare the results. It’s important that you follow the Amazon Bedrock data protection policies when using public datasets.

    This post follows the process depicted in the following diagram. For more details about the architecture, refer to the solution overview and design in Parts 1 and 2 of the series.

    Process flow

    We selected 100 random questions from SlideVQA to create a sample dataset to test solutions from Part 1 and Part 2.

    The responses to the questions in the sample dataset are as concise as possible, as shown in the following example:

    "question": "What is the process by which the breaking of hydrogen bonds allows water to change from the liquid phase into the gaseous phase which has reached equilibrium with the liquid surface said to have done?"
    
    "answer": "reached saturation"

    The responses from large language models (LLMs) are quite verbose:

    According to the information provided in the images, the process by which the breaking of hydrogen bonds allows water to change from the liquid phase into the gaseous phase that has reached equilibrium with the liquid surface is said to have reached saturation.
    
    The key points are:
    
    1. Evaporation involves the breaking of hydrogen bonds that hold water molecules together in the liquid phase, allowing them to transition into the gaseous (vapor) phase.
    
    2. Only the fastest moving water molecules with enough energy can overcome the hydrogen bonding and evaporate into the vapor phase.
    
    3. The evaporation process that has reached equilibrium with the liquid surface, where the vapor pressure is balanced with the evaporation rate, is described as having reached saturation.
    
    So in summary, the breaking of hydrogen bonds provides the mechanism for water molecules to gain enough energy to escape the liquid phase as vapor, and when this vapor has reached equilibrium with the liquid surface, it is said to have reached saturation.We updated the prompts in each approach to provide short responses instead of verbose responses. This helped match the output context length to the ground truth responses in the sample dataset.

    The following sections briefly discuss the solutions and dive into the evaluation and pricing for each approach.

    Approach 1: Embed first, infer later

    Slide decks are converted into PDF images, one per slide, and embedded using the Amazon Titan Multimodal Embeddings model, resulting in a vector embedding of 1,024 dimensions. The embeddings are stored in an OpenSearch Serverless index, which serves as the vector store for our Retrieval Augmented Generation (RAG) solution. The embeddings are ingested using an Amazon OpenSearch Ingestion pipeline.

    Each question is converted into embeddings using the Amazon Titan Multimodal Embeddings model, and an OpenSearch vector search is performed using these embeddings. We performed a k-nearest neighbor (k-NN) search to retrieve the most relevant embedding matching the question. The metadata of the response from the OpenSearch index contains a path to the image corresponding to the most relevant slide.

    The following prompt is created by combining the question and the image path, and is sent to Anthropic’s Claude 3 Sonnet to respond to the question with a concise answer:

    Human: Your role is to provide a precise answer to the question in the <question></question> tags. Search the image provided to answer the question. Retrieve the most accurate answer in as few words as possible. Do not make up an answer. For questions that ask for numbers, follow the instructions below in the <instructions></instructions> tags. Skip the preamble and provide only the exact precise answer.
    
    If the image does not contain the answer to the question below, then respond with two words only - "no answer".
    
    Refer to the question and instructions below:
    
    <question>
    {question}
    </question>
    
    
    <instructions>
    1. Search for relevant data and numbers in the charts and graphs present in the image.
    
    2. If the image does not provide a direct answer to the user question, just say "no answer". Do not add statements like "The image does not provide..." and "It only mentions...", instead just respond with "no answer".
    
    3. Do not add any tags in your answer.
    
    4. Scan for the direct answer to the user question. If there is more than one direct answer, give everything that seems like a valid answer to the question in your response.
    
    5. Search for the question deeply in the image. If the question asks about any data or statistics, look for it in charts, tables, graphs first, and then in texts. Check the headings in the image.
    
    </instructions>
    
    If the image does not contain the answer, or if image does not directly answer the user question, do not respond with "The image does not provide..." or anything similar. In this case, your response should always be "no answer" and nothing else.
    
    Assistant: Here is my response to the question. I will give a direct and precise answer to the question if I find it and if not, I will say "no answer":

    We used Anthropic’s Claude 3 Sonnet instead of LLaVA 1.5-7b as mentioned in the solution for Part 1. The approach remains the same, “embed first, infer later,” but the model that compiles the final response is changed for simplicity and comparability between approaches.

    A response for each question in the dataset is recorded in JSON format and compared to the ground truth provided by SlideVQA.

    This approach retrieved a response for 78% of the questions on a dataset of 100 questions, achieving a 50% accuracy on the final responses.

    Approach 2: Infer first, embed later

    Slide decks are converted into PDF images, one per slide, and passed to Anthropic’s Claude 3 Sonnet to generate a text description. The description is sent to the Amazon Titan Text Embeddings model to generate vector embeddings with 1,536 dimensions. The embeddings are ingested into an OpenSearch Serverless index using an OpenSearch Ingestion pipeline.

    Each question is converted into embeddings using the Amazon Titan Text Embeddings model and an OpenSearch vector search is performed using these embeddings. We performed a k-NN search to retrieve the most relevant embedding matching the question. The metadata of the response from the OpenSearch index contains the image description corresponding to the most relevant slide.

    We create a prompt with the question and image description and pass it to Anthropic’s Claude 3 Sonnet to receive a precise answer. The following is the prompt template:

    Human: Your role is to provide a precise answer to the question in the <question></question> tags. Search the summary provided in the <summary></summary> tags to answer the question. Retrieve the most accurate answer in as few words as possible. Do not make up an answer. For questions that ask for numbers, follow the instructions below in the <instructions></instructions> tags. Skip the preamble and provide only the exact precise answer.
    
    If the summary does not contain the answer to the question below, then respond with two words only - "no answer".
    
    Refer to the question, summary, and instructions below:
    
    <question>
    {question}
    </question>
    
    <summary>
    {summary}
    </summary>
    
    <instructions>
    1. Search for relevant data and numbers in the summary.
    
    2. If the summary does not provide a direct answer to the user question, just say "no answer". Do not add statements like "The summary does not specify..." and "I do not have enough information...", instead just respond with "no answer".
    
    3. Do not add any tags in your answer.
    
    4. Scan for the direct answer to the user question. If there is more than one direct answer, give everything that seems like a valid answer to the question in your response.
    
    </instructions>
    
    If the summary does not contain the answer, or if summary does not directly answer the user question, do not respond with "The summary does not provide..." or anything similar. In this case, your response should always be "no answer" and nothing else.
    
    Assistant: Here is my response to the question. I will give a direct and precise answer to the question if I find it and if not, I will say "no answer":

    The response for each question in the dataset is recorded in JSON format for ease of comparison. The response is compared to the ground truth provided by SlideVQA.

    With this approach, we received 44% accuracy on final responses with 75% of the questions retrieving a response out of the 100 questions in the sample dataset.

    Analysis of results

    In our testing, both approaches produced 50% or less matching results to the questions in the sample dataset. The sample dataset contains a random selection of slide decks covering a wide variety of topics, including retail, healthcare, academic, technology, personal, and travel. Therefore, for a generic question like, “What are examples of tools that can be used?” which lacks additional context, the nearest match could retrieve responses from a variety of topics, leading to inaccurate results, especially when all embeddings are being ingested in the same OpenSearch index. The use of techniques such as hybrid search, pre-filtering based on metadata, and reranking can be used to improve the retrieval accuracy.

    One of the solutions is to retrieve more results (increase the k value) and reorder them to keep the most relevant ones; this technique is called reranking. We share additional ways to improve the accuracy of the results later in this post.

    The final prompts to Anthropic’s Claude 3 Sonnet in our analysis included instructions to provide a concise answer in as few words as possible to be able to compare with the ground truth. Your responses will depend on your prompts to the LLM.

    Pricing

    Pricing is dependent on the modality, provider, and model used. For more details, refer to Amazon Bedrock pricing. We use the On-Demand and Batch pricing mode in our analysis, which allow you to use FMs on a pay-as-you-go basis without having to make time-based term commitments. For text-generation models, you are charged for every input token processed and every output token generated. For embeddings models, you are charged for every input token processed.

    The following tables show the price per question for each approach. We calculated the average number of input and output tokens based on our sample dataset for the us-east-1 AWS Region; pricing may vary based on your datasets and Region used.

    You can use the following tables for guidance. Refer to the Amazon Bedrock pricing website for additional information.

    Approach 1  
        Input Tokens   Output Tokens  
    Model Description Price per 1,000 Tokens / Price per Input Image Number of Tokens Price Price per 1,000 Tokens Number of Tokens Price
    Amazon Titan Multimodal Embeddings Slide/image embedding $0.00006 1 $0.00000006 $0.000 0 $0.00000
    Amazon Titan Multimodal Embeddings Question embedding $0.00080 20 $0.00001600 $0.000 0 $0.00000
    Anthropic’s Claude 3 Sonnet Final response $0.00300 700 $0.00210000 $0.015 8 $0.00012
    Cost per input/output $0.00211606 $0.00012
    Total cost per question             $0.00224
      Approach 2     
        Input Tokens   Output Tokens  
    Model Description Price per 1,000 Tokens / Price per Input Image Number of Tokens Price Price per 1,000 Tokens Number of Tokens Price
    Anthropic’s Claude 3 Sonnet Slide/image description $0.00300 4523 $0.01356900 $0.015 350 $0.00525
    Amazon Titan Text Embeddings Slide/image description embedding $0.00010 350 $0.00003500 $0.000 0 $0.00000
    Amazon Titan Text Embeddings Question embedding $0.00010 20 $0.00000200 $0.000 0 $0.00000
    Anthropic’s Claude 3 Sonnet Final response $0.00300 700 $0.00210000 $0.015 8 $0.00012
    Cost per input/output $0.01570600 $0.00537
    Total cost per question             $0.02108

    Clean up

    To avoid incurring charges, delete any resources from Parts 1 and 2 of the solution. You can do this by deleting the stacks using the AWS CloudFormation console.

    Conclusion

    In Parts 1 and 2 of this series, we explored ways to use the power of multimodal FMs such as Amazon Titan Multimodal Embeddings, Amazon Titan Text Embeddings, and Anthropic’s Claude 3 Sonnet. In this post, we compared the approaches from an accuracy and pricing perspective.

    Code for all parts of the series is available in the GitHub repo. We encourage you to deploy both approaches and explore different Anthropic Claude models available on Amazon Bedrock. You can discover new information and uncover new perspectives using your organization’s slide content with either approach. Compare the two approaches to identify a better workflow for your slide decks.

    With generative AI rapidly developing, there are several ways to improve the results and approach the problem. We are exploring performing a hybrid search and adding search filters by extracting entities from the question to improve the results. Part 4 in this series will explore these concepts in detail.

    Portions of this code are released under the Apache 2.0 License.

    Resources

    [1] Tanaka, Ryota & Nishida, Kyosuke & Nishida, Kosuke & Hasegawa, Taku & Saito, Itsumi & Saito, Kuniko. (2023). SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. Proceedings of the AAAI Conference on Artificial Intelligence. 37. 13636-13645. 10.1609/aaai.v37i11.26598.


    About the Authors

    Archana Inapudi is a Senior Solutions Architect at AWS, supporting a strategic customer. She has over a decade of cross-industry expertise leading strategic technical initiatives. Archana is an aspiring member of the AI/ML technical field community at AWS. Prior to joining AWS, Archana led a migration from traditional siloed data sources to Hadoop at a healthcare company. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes.

    Manju Prasad is a Senior Solutions Architect at Amazon Web Services. She focuses on providing technical guidance in a variety of technical domains, including AI/ML. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup. She has worked in all layers of the software stack, ranging from webdev to databases, and has experience in all levels of the software development lifecycle. She is passionate about sharing knowledge and fostering interest in emerging talent.

    Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.

    Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticlePixtral 12B is now available on Amazon SageMaker JumpStart
    Next Article Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 23, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47512 – Tainacan Path Traversal

    May 23, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Dreaming Of Miracles (December 2024 Wallpapers Edition)

    Development

    Microsoft is bringing back seconds to Windows 10 Clock after outrage

    Operating Systems

    North Korean Hackers Exploit Facebook Messenger in Targeted Malware Campaign

    Development

    CVE-2025-40615 – Bookgy Reflected Cross-Site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    App doesn’t launch in emulator, error says- io.appium.java_client.remote.AppiumCommandExecutor$1 lambda$0 INFO: Detected dialect: W3C

    June 12, 2024

    I am new in mobile automation and first time trying to launch the app in emulator. Just the mobile screen blinks but My app doesn’t get launched. Can anyone help me..
    import java.io.File;
    import java.net.MalformedURLException;
    import java.net.URL;

    import org.openqa.selenium.remote.DesiredCapabilities;

    import io.appium.java_client.AppiumDriver;
    import io.appium.java_client.MobileElement;
    import io.appium.java_client.android.AndroidDriver;
    import io.appium.java_client.remote.MobileCapabilityType;

    public class AppTest {

    public static AppiumDriver<MobileElement>appium;

    public static void main(String[] args) {
    // TODO Auto-generated method stub
    File app = new File(“C:\Users\User\Desktop\iTest-20.07(43).apk”);
    DesiredCapabilities caps = new DesiredCapabilities();
    caps.setCapability(MobileCapabilityType.APP,app.getAbsolutePath());
    caps.setCapability(“deviceName”,”emulator-5554″);
    caps.setCapability(“platformName”, “Android”);
    caps.setCapability(“platformVersion”, “8.0”);
    caps.setCapability(“appPackage”, “com.google.android.apps.nexuslaunch”);
    caps.setCapability(“appActivity”, “com.google.android.apps.nexuslaunch.NexusLaunchActivity”);
    caps.setCapability(“noReset”, “false”);
    caps.setCapability(“disableWindowAnimation”, false);

    try {
    appium = new AndroidDriver<MobileElement>(new URL(“http://0.0.0.0:4723/wd/hub”), caps);
    } catch (MalformedURLException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    System.out.println(“Application Launched”);
    }
    }

    Result:
    io.appium.java_client.remote.AppiumCommandExecutor$1 lambda$0
    INFO: Detected dialect: W3C

    Need an advice on how to become QA tester

    June 8, 2024

    Optimize costs with scheduled scaling of Amazon DocumentDB for read workloads

    June 12, 2024
    Explosive Growth of Non-Human Identities Creating Massive Security Blind Spots

    Explosive Growth of Non-Human Identities Creating Massive Security Blind Spots

    April 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.