Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Report: 71% of tech leaders won’t hire devs without AI skills

      July 17, 2025

      Slack’s AI search now works across an organization’s entire knowledge base

      July 17, 2025

      In-House vs Outsourcing for React.js Development: Understand What Is Best for Your Enterprise

      July 17, 2025

      Tiny Screens, Big Impact: The Forgotten Art Of Developing Web Apps For Feature Phones

      July 16, 2025

      Pokémon has partnered with one of the biggest PC gaming brands again, and you can actually buy these accessories — but do you even want to?

      July 17, 2025

      AMD’s budget Ryzen AI 5 330 processor will introduce a wave of ultra-affordable Copilot+ PCs with its mobile 50 TOPS NPU

      July 17, 2025

      Steam takes down tons of porn games, cracks down on “certain kinds of adult-only content” — here’s why, and its new policy

      July 17, 2025

      Oblivion Remastered and Metal Gear Solid Delta co-developer Virtuos faces layoffs — with 270 workers cut

      July 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 17, 2025
      Recent

      The details of TC39’s last meeting

      July 17, 2025

      Notes Android App Using SQLite

      July 17, 2025

      How to Get Security Patches for Legacy Unsupported Node.js Versions

      July 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Pokémon has partnered with one of the biggest PC gaming brands again, and you can actually buy these accessories — but do you even want to?

      July 17, 2025
      Recent

      Pokémon has partnered with one of the biggest PC gaming brands again, and you can actually buy these accessories — but do you even want to?

      July 17, 2025

      AMD’s budget Ryzen AI 5 330 processor will introduce a wave of ultra-affordable Copilot+ PCs with its mobile 50 TOPS NPU

      July 17, 2025

      Steam takes down tons of porn games, cracks down on “certain kinds of adult-only content” — here’s why, and its new policy

      July 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Tailoring foundation models for your business needs: A comprehensive guide to RAG, fine-tuning, and hybrid approaches

    Tailoring foundation models for your business needs: A comprehensive guide to RAG, fine-tuning, and hybrid approaches

    May 28, 2025

    Foundation models (FMs) have revolutionised AI capabilities, but adopting them for specific business needs can be challenging. Organizations often struggle with balancing model performance, cost-efficiency, and the need for domain-specific knowledge. This blog post explores three powerful techniques for tailoring FMs to your unique requirements: Retrieval Augmented Generation (RAG), fine-tuning, and a hybrid approach combining both methods. We dive into the advantages, limitations, and ideal use cases for each strategy.

    AWS provides a suite of services and features to simplify the implementation of these techniques. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Knowledge Bases provides native support for RAG, streamlining the process of enhancing model outputs with domain-specific information. Amazon Bedrock also offers native features for model customizations through continued pre-training and fine-tuning. In addition, you can use Amazon Bedrock Custom Model Import to bring and use your customized models alongside existing FMs through a single serverless, unified API. Use Amazon Bedrock Model Distillation to use smaller, faster, more cost-effective models that deliver use-case specific accuracy that is comparable to the most advanced models in Amazon Bedrock.

    For this post, we have used Amazon SageMaker AI for the fine-tuning and hybrid approach to maintain more control over the fine-tuning script and try different fine-tuning methods. In addition, we have used Amazon Bedrock Knowledge Bases for the RAG approach as shown in Figure 1.

    To help you make informed decisions, we provide ready-to-use code in our Github repo, using these AWS services to experiment with RAG, fine-tuning, and hybrid approaches. You can evaluate their performance based on your specific use case and your dataset, and use the model that best fits to effectively customize FMs for your business needs.

    Figure 1: Architecture diagram for RAG, fine-tuning and hybrid approaches

    Retrieval Augmented Generation

    RAG is a cost-effective way to enhance AI capabilities by connecting existing models to external knowledge sources. For example, an AI powered customer service chatbot using RAG can answer questions about current product features by first checking the product documentation knowledge base. If a customer asks a question, the system retrieves the specific details from the product knowledge base before composing its response, helping to make sure that the information is accurate and up-to-date.

    A RAG approach gives AI models access to external knowledge sources for better responses and has two main steps: retrieval for finding the relevant information from connected data sources and generation using an FM to generate an answer based on the retrieved information.

    Fine-tuning

    Fine-tuning is a powerful way to customize FMs for specific tasks or domains using additional training data. In fine-tuning, you adjust the model’s parameters using a smaller, labelled dataset relevant to the target domain.

    For example, to build an AI powered customer service chatbot, you can fine-tune an existing FM using your own dataset to handle questions about a company’s product features. By training the model on historical customer interactions and product specifications, the fine-tuned model learns the context and the company messaging tone to provide more accurate responses.

    If the company launches a new product, the model should be fine-tuned again with new data to update its knowledge and maintain relevance. Fine-tuning helps make sure that the model can deliver precise, context-aware responses. However, it requires more computational resources and time compared to RAG, because the model itself needs to be retrained with the new data.

    Hybrid approach

    The hybrid approach combines the strengths of RAG and fine-tuning to deliver highly accurate, context-aware responses. Let’s consider an example, a company frequently updates the features of its products. They want to customize their FM using internal data, but keeping the model updated with changes in the product catalog is challenging. Because product features change monthly, keeping the model up to date would be costly and time-consuming.

    By adopting a hybrid approach, the company can reduce costs and improve efficiency. They can fine-tune the model every couple of months to keep it aligned with the company’s overall tone. Meanwhile, RAG can retrieve the latest product information from the company’s knowledge base, helping to make sure that responses are up-to-date. Fine-tuning the model also enhances RAG’s performance during the generation phase, leading to more coherent and contextually relevant responses. If you want to further improve the retrieval phase, you can customize the embedding model, use a different search algorithm, or explore other retrieval optimization techniques.

    The following sections provide the background for dataset creation and implementation of the three different approaches

    Prerequisites

    To deploy the solution, you need:

    • An AWS account. If you don’t already have an AWS account, you can create one.
    • Your access to the AWS account must have AWS Identity and Access Management (IAM) permissions to launch AWS CloudFormation templates that create IAM roles.
    • Install AWS Command Line Interface (AWS CLI).
    • Install Docker
    • AWS Cloud Development Kit (AWS CDK). See Getting started with the AWS CDK.

    Dataset description

    For the proof-of-concept, we created two synthetic datasets using Anthropic’s Claude 3 Sonnet on Amazon Bedrock.

    Product catalog dataset

    This dataset is your primary knowledge source in Amazon Bedrock. We created a product catalog which consists of 15 fictitious manufacturing products by prompting Anthropic’s Claude 3 Sonnet using example product catalogs. You should create your dataset in .txt format. The format in the example for this post has the following fields:

    • Product names
    • Product descriptions
    • Safety instructions
    • Configuration manuals
    • Operation instructions

    Train and test the dataset

    We use the same product catalog we created for the RAG approach as training data to run domain adaptation fine-tuning.

    The test dataset consists of question-and-answer pairs about the product catalog dataset created earlier. We used this code in the Question-Answer Dataset Jupyter notebook section to generate the test dataset.

    Implementation

    We implemented three different approaches: RAG, fine-tuning, and hybrid. See the Readme file for instructions to deploy the whole solution.

    RAG

    The RAG approach uses Amazon Bedrock Knowledge Bases and consists of two main parts.

    To set up the infrastructure:

    1. Update the config file with your required data (details in the Readme)
    2. Run the following commands in the infrastructure folder:
    cd infrastructure
    ./prepare.sh
    cdk bootstrap aws://<<ACCOUNT_ID>>/<<REGION>>
    cdk synth
    cdk deploy --all

    Context retrieval and response generation:

    1. The system finds relevant information by searching the knowledge base with the user’s question
    2. It then sends both the user’s question and the retrieved information to Meta LLama 3.1 8b LLM on Amazon Bedrock
    3. The LLM will then generate a response based on the user’s question and retrieved information

    Fine-tuning

    We used Amazon SageMaker AI JumpStart to fine-tune the Meta Llama 3.1 8b Instruct model using domain adaptation method for 5 epochs. You can adjust the following parameters in the config.py file:

    • Fine-tuning method: You can change the fine-tuning method in the config file, the default is domain_adaptation.
    • Number of epochs: Adjust number of epochs in the config file according to your data size.
    • Fine-tuning template: Change the template based on your use-case. The current one prompts the LLM to answer a customer question.

    Hybrid

    The hybrid approach combines RAG and fine-tuning, and uses the following high-level steps:

    1. Retrieve the most relevant context based on the user’s question from the Knowledge Base
    2. The fine-tuned model generates answers using the retrieved context

    You can customize the prompt template in the config.py file.

    Evaluation

    For this example, we use three evaluation metrics to measure performance. You can modify src/evaluation.py to implement your own metrics for your evaluation implementation.

    Each metric helps you understand different aspects of how well each of the approaches works:

    • BERTScore: BERTScore tells you how similar the generated answers are to the correct answers using cosine similarities. It calculates precision, recall, and F1 measure. We used the F1 measure as the evaluation score.
    • LLM evaluator score: We use different language models from Amazon Bedrock to score the responses from RAG, fine-tuning, and Hybrid approaches. Each evaluation receives both the correct answers and the generated answers and gives a score between 0 and 1 (closer to 1 indicates higher similarity) for each generated answer. We then calculate the final score by averaging all the evaluation scores. The process is shown in the following figure.

    Figure 2: LLM evaluator method

    • Inference latency: Response times are important in applications like chatbots, so depending on your use case, this metric might be important in your decision. For each approach, we averaged the time it took to receive a full response for each sample.
    • Cost analysis: To do a full cost analysis, we made the following assumptions:
      • We used one OpenSearch compute unit (OCU) for indexing and another for the search related to document indexing in RAG. See OpenSearch Serverless pricing for more details.
      • We assume an application that has 1,000 users, each of them conducting 10 requests per day with an average of 2,000 input tokens and 1,000 output tokens. See Amazon Bedrock pricing for more details.
      • We used ml.g5.12xlarge instance for fine-tuning and hosting the fine-tuned model. The fine-tuning job took 15 minutes to complete. See SageMaker AI pricing for more details.
      • For fine-tuning and the hybrid approach, we assume that the model instance is up 24/7, which might vary according to your use case.
      • The cost calculation is done for one month.

    Based on those assumptions, the cost associated with each of the three approaches is calculated as follows:

    • For RAG: 
      • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
      • Total invocations for Meta Llama 3.1 8b = 1000 user * 10 requests * (price per input token * 2,000 + price per output token * 1,000) * 30 days
    • For fine-tuning:
      • (Number of minutes used for the fine-tuning job / 60) * Hourly cost of an ml.g5.12xlarge instance
      • Hourly cost of an ml.g5.12xlarge instance hosting * 24 hours * 30 days
    • For hybrid:
      • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
      • (Number of minutes used for the finetuning job / 60) * cost of an ml.g5.12xlarge instance
      • Hourly cost of ml.g5.12xlarge instance hosting * 24 hours * 30 days

    Results

    You can find detailed evaluation results in two places in the code repository. The individual scores for each sample are in the JSON files under data/output, and a summary of the results is in summary_results.csv in the same folder.

    The results shown in the following table show:

    • How each approach (RAG, fine-tuning, and hybrid) performs
    • Their scores from both BERTScore and LLM evaluators
    • The cost analysis for each method calculated for the US East region
    Approach Average BERTScore Average LLM evaluator score Average inference time (in seconds) Cost per month (US East region)
    RAG 0.8999 0.8200 8.336 ~=350 + 198 ~= 548$
    Finetuning 0.8660 0.5556 4.159 ~= 1.77 + 5105 ~= 5107$
    Hybrid 0.8908 0.8556 17.700 ~= 350 + 1.77 + 5105 ~= 5457$

    Note that the costs for both the fine-tuning and hybrid approach can decrease significantly depending on the traffic pattern if you set the real-time inference endpoint from SageMaker to scaledown to zero instances when not in use. 

    Clean up

    Follow the cleanup section in the Readme file in order to avoid paying for unused resources.

    Conclusion

    In this post, we showed you how to implement and evaluate three powerful techniques for tailoring FMs to your business needs: RAG, fine-tuning, and a hybrid approach combining both methods. We provided ready-to-use code to help you experiment with these approaches and make informed decisions based on your specific use case and dataset.

    The results in this example were specific to the dataset that we used. For that dataset, RAG outperformed fine-tuning and achieved comparable results to the hybrid approach with a lower cost, but fine-tuning led to the lowest latency. Your results will vary depending on your dataset.

    We encourage you to test these approaches using our code as a starting point:

    1. Add your own datasets in the data folder
    2. Fill out the config.py file
    3. Follow the rest of the readme instructions to run the full evaluation

    About the Authors

    Idil Yuksel is a Working Student Solutions Architect at AWS, pursuing her MSc. in Informatics with a focus on machine learning at the Technical University of Munich. She is passionate about exploring application areas of machine learning and natural language processing. Outside of work and studies, she enjoys spending time in nature and practicing yoga.

    Karim Akhnoukh is a Senior Solutions Architect at AWS working with customers in the financial services and insurance industries in Germany. He is passionate about applying machine learning and generative AI to solve customers’ business challenges. Besides work, he enjoys playing sports, aimless walks, and good quality coffee.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuilding a multimodal RAG based application using Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases
    Next Article How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 17, 2025
    Machine Learning

    Implementing on-demand deployment with customized Amazon Nova models on Amazon Bedrock

    July 17, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-6146 – TOTOLINK X15 HTTP POST Request Handler Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-20163 – Cisco Nexus Dashboard Fabric Controller SSH Host Key Validation Impersonation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4036 – Apache Novel Remote Code Execution via Improper Access Control

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-39386 – Mojoomla Hospital Management System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Best early Prime Day laptop deals: My 12 favorite sales live now

    June 17, 2025

    Amazon Prime Day isn’t live yet, but we’ve already spotted some early laptop deals that…

    CVE-2025-52363 – Tenda CP3 Pro Root Password Hash Hardcoded Vulnerability

    July 14, 2025

    Ago is a small static blog generator without any fuzz

    June 9, 2025

    Transform Your Workflow With These 10 Essential Yet Overlooked Linux Tools You Need to Try

    May 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.