Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 5, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 5, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 5, 2025

      In MCP era API discoverability is now more important than ever

      June 5, 2025

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025

      Anthropic researchers say next-generation AI models will reduce humans to “meat robots” in a spectrum of crazy futures

      June 5, 2025

      Xbox just quietly added two of the best RPGs of all time to Game Pass

      June 5, 2025

      7 reasons The Division 2 is a game you should be playing in 2025

      June 5, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Mastering TypeScript: How Complex Should Your Types Be?

      June 5, 2025
      Recent

      Mastering TypeScript: How Complex Should Your Types Be?

      June 5, 2025

      IDMC – CDI Best Practices

      June 5, 2025

      PWC-IDMC Migration Gaps

      June 5, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025
      Recent

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025

      Anthropic researchers say next-generation AI models will reduce humans to “meat robots” in a spectrum of crazy futures

      June 5, 2025

      Xbox just quietly added two of the best RPGs of all time to Game Pass

      June 5, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

    A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

    April 18, 2025

    Evaluating LLMs has emerged as a pivotal challenge in advancing the reliability and utility of artificial intelligence across both academic and industrial settings. As the capabilities of these models expand, so too does the need for rigorous, reproducible, and multi-faceted evaluation methodologies. In this tutorial, we provide a comprehensive examination of one of the field’s most critical frontiers: systematically evaluating the strengths and limitations of LLMs across various dimensions of performance. Using Google’s cutting-edge Generative AI models as benchmarks and the LangChain library as our orchestration tool, we present a robust and modular evaluation pipeline tailored for implementation in Google Colab. This framework integrates criterion-based scoring, encompassing correctness, relevance, coherence, and conciseness, with pairwise model comparisons and rich visual analytics to deliver nuanced and actionable insights. Grounded in expert-validated question sets and objective ground truth answers, this approach balances quantitative rigor with practical adaptability, offering researchers and developers a ready-to-use, extensible toolkit for high-fidelity LLM evaluation.

    Copy CodeCopiedUse a different Browser
    !pip install langchain langchain-google-genai ragas pandas matplotlib

    We install key Python libraries for building and running AI-powered workflows, LangChain for orchestrating LLM interactions (with the langchain-google-genai extension for Google’s generative AI), Ragas for retrieval-augmented generation, and pandas plus matplotlib for data manipulation and visualization.

    Copy CodeCopiedUse a different Browser
    import os
    import pandas as pd
    import matplotlib.pyplot as plt
    from langchain_google_genai import ChatGoogleGenerativeAI
    from langchain.prompts import PromptTemplate
    from langchain.chains import LLMChain
    from langchain.evaluation import load_evaluator
    from langchain.schema import HumanMessage

    We incorporate core Python utilities, including os for environment management, pandas for handling DataFrames, and matplotlib.pyplot for plotting, alongside LangChain’s Google Generative AI client, prompt templating, chain construction, evaluator loader, and the HumanMessage schema to build and assess conversational LLM pipelines.

    Copy CodeCopiedUse a different Browser
    os.environ["GOOGLE_API_KEY"] = "Use Your API Key"

    Here, we configure your environment by storing your Google API key in the GOOGLE_API_KEY variable, allowing the LangChain Google Generative AI client to authenticate requests securely.

    Copy CodeCopiedUse a different Browser
    def create_evaluation_dataset():
        """Create a simple dataset for evaluation."""
        questions = [
            "Explain the concept of quantum computing in simple terms.",
            "How does a neural network learn?",
            "What are the main differences between SQL and NoSQL databases?",
            "Explain how blockchain technology works.",
            "What is the difference between supervised and unsupervised learning?"
        ]
       
        ground_truth = [
            "Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
            "Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
            "SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
            "Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
            "Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
        ]
       
        return pd.DataFrame({"question": questions, "ground_truth": ground_truth})

    We construct a small evaluation DataFrame by pairing five example questions on AI and database concepts with their corresponding ground‑truth answers, making it easy to benchmark an LLM’s responses against predefined correct outputs.

    Copy CodeCopiedUse a different Browser
    def setup_models():
        """Set up different Google Generative AI models for comparison."""
        models = {
            "gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
            "gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
        }
        return models

    Now, this function instantiates two zero‑temperature ChatGoogleGenerativeAI clients, one using the lightweight “gemini‑2.0‑flash‑lite” model and the other the full “gemini‑2.0‑flash” model, so you can easily compare their outputs side‑by‑side.

    Copy CodeCopiedUse a different Browser
    def generate_responses(models, dataset):
        """Generate responses from each model for the questions in the dataset."""
        responses = {}
       
        for model_name, model in models.items():
            model_responses = []
            for question in dataset["question"]:
                try:
                    response = model.invoke([HumanMessage(content=question)])
                    model_responses.append(response.content)
                except Exception as e:
                    print(f"Error with model {model_name} on question: {question}")
                    print(f"Error: {e}")
                    model_responses.append("Error generating response")
           
            responses[model_name] = model_responses
       
        return responses

    This function loops through each configured model and each question in the dataset, invokes the model to generate a response, catches any errors (logging them and inserting a placeholder), and returns a dictionary mapping each model’s name to its list of generated answers.

    Copy CodeCopiedUse a different Browser
    def evaluate_responses(models, dataset, responses):
        """Evaluate model responses using different evaluation criteria."""
        evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
       
        reference_criteria = ["correctness"]
        reference_free_criteria = [
            "relevance",  
            "coherence",    
            "conciseness"  
        ]
       
        results = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
                   for model_name in models.keys()}
       
        for criterion in reference_criteria:
            evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
           
            for model_name in models.keys():
                for i, question in enumerate(dataset["question"]):
                    ground_truth = dataset["ground_truth"][i]
                    response = responses[model_name][i]
                   
                    if response != "Error generating response":
                        eval_result = evaluator.evaluate_strings(
                            prediction=response,
                            reference=ground_truth,
                            input=question
                        )
                        normalized_score = float(eval_result.get('score', 0)) * 2
                        results[model_name][criterion].append(normalized_score)
                    else:
                        results[model_name][criterion].append(0)  
       
        for criterion in reference_free_criteria:
            evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
           
            for model_name in models.keys():
                for i, question in enumerate(dataset["question"]):
                    response = responses[model_name][i]
                   
                    if response != "Error generating response":
                        eval_result = evaluator.evaluate_strings(
                            prediction=response,
                            input=question
                        )
                        normalized_score = float(eval_result.get('score', 0)) * 2
                        results[model_name][criterion].append(normalized_score)
                    else:
                        results[model_name][criterion].append(0)  
        return results

    This function leverages a “gemini‑2.0‑flash‑lite” evaluator to score each model’s answers on both reference‑based correctness and reference‑free metrics (relevance, coherence, conciseness), normalizes those scores, and returns a nested dictionary mapping each model and criterion to its list of evaluation results.

    Copy CodeCopiedUse a different Browser
    def calculate_average_scores(evaluation_results):
        """Calculate average scores for each model and criterion."""
        avg_scores = {}
       
        for model_name, criteria in evaluation_results.items():
            avg_scores[model_name] = {}
           
            for criterion, scores in criteria.items():
                if scores:
                    avg_scores[model_name][criterion] = sum(scores) / len(scores)
                else:
                    avg_scores[model_name][criterion] = 0
                   
            all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
            if all_scores:
                avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
            else:
                avg_scores[model_name]["overall"] = 0
               
        return avg_scores

    This function processes the nested evaluation results to compute the mean score for each criterion across all questions for every model. Also, it calculates an overall average by pooling all individual metric scores. The returned dictionary maps each model to its per‑criterion averages and an aggregated “overall” performance score.

    Copy CodeCopiedUse a different Browser
    def visualize_results(avg_scores):
        """Visualize evaluation results with bar charts."""
        models = list(avg_scores.keys())
        criteria = list(avg_scores[models[0]].keys())
       
        plt.figure(figsize=(14, 8))
       
        bar_width = 0.8 / len(models)
       
        positions = range(len(criteria))
       
        for i, model in enumerate(models):
            model_scores = [avg_scores[model][criterion] for criterion in criteria]
            plt.bar([p + i * bar_width for p in positions], model_scores,
                    width=bar_width, label=model)
       
        plt.xlabel('Evaluation Criteria', fontsize=12)
        plt.ylabel('Average Score (0-10)', fontsize=12)
        plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
        plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], criteria)
        plt.legend()
        plt.grid(axis='y', linestyle='--', alpha=0.7)
       
        plt.tight_layout()
        plt.show()
       
        plt.figure(figsize=(10, 8))
       
        categories = [c for c in criteria if c != 'overall']
        N = len(categories)
       
        angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
        angles += angles[:1]  
       
        plt.polar(angles, [0] * (N + 1))
        plt.xticks(angles[:-1], categories)
       
        for model in models:
            values = [avg_scores[model][c] for c in categories]
            values += values[:1]  
            plt.polar(angles, values, label=model)
       
        plt.legend(loc='upper right')
        plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
        plt.tight_layout()
        plt.show()
    

    This function creates side-by-side bar charts to compare each model’s average scores across all evaluation criteria. Then it renders a radar chart to visualize their performance profiles, enabling quick identification of relative strengths and weaknesses.

    Copy CodeCopiedUse a different Browser
    def main():
        print("Creating evaluation dataset...")
        dataset = create_evaluation_dataset()
       
        print("Setting up models...")
        models = setup_models()
       
        print("Generating responses...")
        responses = generate_responses(models, dataset)
       
        print("Evaluating responses...")
        evaluation_results = evaluate_responses(models, dataset, responses)
       
        print("Calculating average scores...")
        avg_scores = calculate_average_scores(evaluation_results)
       
        print("Average scores:")
        for model, scores in avg_scores.items():
            print(f"n{model}:")
            for criterion, score in scores.items():
                print(f"  {criterion}: {score:.2f}")
       
        print("nVisualizing results...")
        visualize_results(avg_scores)
       
        print("Saving results to CSV...")
        results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
        for model, criteria in avg_scores.items():
            for criterion, score in criteria.items():
                results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                      ignore_index=True)
       
        results_df.to_csv("llm_evaluation_results.csv", index=False)
        print("Results saved to llm_evaluation_results.csv")
       
        detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
       
        for i, question in enumerate(dataset["question"]):
            row = {
                "Question": question,
                "Ground Truth": dataset["ground_truth"][i]
            }
           
            for model_name in models.keys():
                row[model_name] = responses[model_name][i]
           
            detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
       
        detailed_df.to_csv("llm_response_comparison.csv", index=False)
        print("Detailed responses saved to llm_response_comparison.csv")

    The main function orchestrates the entire evaluation workflow end‑to‑end: it builds the dataset, initializes models, generates and scores responses, computes and displays average metrics, visualizes performance with charts, and finally exports both summary and detailed results as CSV files.

    Copy CodeCopiedUse a different Browser
    def pairwise_model_comparison(models, dataset, responses):
        """Compare two models side by side using an LLM as judge."""
        evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
       
        pairwise_template = """
        Question: {question}
       
        Response A: {response_a}
       
        Response B: {response_b}
       
        Which response better answers the user's question? Consider factors like accuracy,
        helpfulness, clarity, and completeness.
       
        First, analyze each response point by point. Then conclude with your choice of either:
        A is better, B is better, or They are equally good/bad.
       
        Your analysis:
        """
       
        pairwise_prompt = PromptTemplate(
            input_variables=["question", "response_a", "response_b"],
            template=pairwise_template
        )
       
        pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
       
        model_names = list(models.keys())
       
        pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
       
        for i, question in enumerate(dataset["question"]):
            for j, model_a in enumerate(model_names):
                for model_b in model_names[j+1:]:  
                    response_a = responses[model_a][i]
                    response_b = responses[model_b][i]
                   
                    if response_a != "Error generating response" and response_b != "Error generating response":
                        comparison_result = pairwise_chain.run(
                            question=question,
                            response_a=response_a,
                            response_b=response_b
                        )
                       
                        key_ab = f"{model_a} vs {model_b}"
                        pairwise_results[key_ab].append({
                            "question": question,
                            "result": comparison_result
                        })
       
        return pairwise_results

    This function runs head-to-head comparisons for each unique model pair by prompting a “gemini-2.0-flash-lite” judge to analyze and rank their responses on accuracy, clarity, and completeness, collecting per-question verdicts into a structured dictionary for side-by-side evaluation.

    Copy CodeCopiedUse a different Browser
    def enhanced_main():
        """Enhanced main function with additional evaluations."""
        print("Creating evaluation dataset...")
        dataset = create_evaluation_dataset()
       
        print("Setting up models...")
        models = setup_models()
       
        print("Generating responses...")
        responses = generate_responses(models, dataset)
       
        print("Evaluating responses...")
        evaluation_results = evaluate_responses(models, dataset, responses)
       
        print("Calculating average scores...")
        avg_scores = calculate_average_scores(evaluation_results)
       
        print("Average scores:")
        for model, scores in avg_scores.items():
            print(f"n{model}:")
            for criterion, score in scores.items():
                print(f"  {criterion}: {score:.2f}")
       
        print("nVisualizing results...")
        visualize_results(avg_scores)
       
        print("nPerforming pairwise model comparison...")
        pairwise_results = pairwise_model_comparison(models, dataset, responses)
       
        print("nPairwise comparison results:")
        for comparison, results in pairwise_results.items():
            print(f"n{comparison}:")
            for i, result in enumerate(results[:2]):
                print(f"  Question {i+1}: {result['question']}")
                print(f"  Analysis: {result['result'][:100]}...")
       
        print("nSaving all results...")
        results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
        for model, criteria in avg_scores.items():
            for criterion, score in criteria.items():
                results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                      ignore_index=True)
       
        results_df.to_csv("llm_evaluation_results.csv", index=False)
       
        detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
       
        for i, question in enumerate(dataset["question"]):
            row = {
                "Question": question,
                "Ground Truth": dataset["ground_truth"][i]
            }
           
            for model_name in models.keys():
                row[model_name] = responses[model_name][i]
           
            detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
       
        detailed_df.to_csv("llm_response_comparison.csv", index=False)
       
        pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
       
        for comparison, results in pairwise_results.items():
            for result in results:
                pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
                    "Comparison": comparison,
                    "Question": result["question"],
                    "Analysis": result["result"]
                }])], ignore_index=True)
       
        pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
       
        print("All results saved to CSV files.")

    The enhanced_main function extends the core evaluation pipeline by adding automated pairwise model comparisons, printing concise progress updates at each stage, and exporting three CSV files, summary scores, detailed responses, and pairwise analysis , so you end up with a complete, side-by-side evaluation workspace.

    Copy CodeCopiedUse a different Browser
    if __name__ == "__main__":
        enhanced_main()
    

    Finally, this guard ensures that when the script is executed directly (not imported), it calls enhanced_main() to run the full evaluation and comparison pipeline end‑to‑end.

    In conclusion, in this tutorial has introduced a versatile and principled framework for evaluating and comparing the performance of LLMs, leveraging Google’s Generative AI capabilities alongside the LangChain library for orchestration. Unlike simplistic accuracy-based metrics, the methodology presented here embraces the multidimensional nature of language understanding, combining granular criterion-based evaluation, structured model-to-model comparison, and intuitive visualizations. By capturing key attributes, including correctness, relevance, coherence, and conciseness, our evaluation pipeline enables practitioners to identify subtle yet significant performance differences that directly impact downstream applications. The outputs, including CSV-based reporting, radar plots, and bar graphs, not only support transparent benchmarking but also guide data-driven decision-making in model selection and deployment.


    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle Unveils Gemini 2.5 Flash in Preview through the Gemini API via Google AI Studio and Vertex AI.
    Next Article Do Reasoning Models Really Need Transformers?: Researchers from TogetherAI, Cornell, Geneva, and Princeton Introduce M1—A Hybrid Mamba-Based AI that Matches SOTA Performance at 3x Inference Speed

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 5, 2025
    Machine Learning

    Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

    June 5, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Integrate natural language processing and generative AI with relational databases

    Databases

    Top Generative Artificial Intelligence AI Courses in 2024

    Development

    DistroWatch Weekly, Issue 1077

    Development

    CVE-2025-4899 – Campcodes Sales and Inventory System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Artificial Intelligence

    Mets’ High-Tech Revolution Begins: Baseball Meets the Future

    January 18, 2025

    Baseball, they say, is America’s pastime. A sport that clings to its traditions as fiercely…

    One of the best Xbox Cloud Gaming mobile controllers with hall-effect sticks and triggers is on a limited-time sale for less than $80

    May 30, 2025

    Lenovo redesigned its flagship gaming laptop to handle RTX 5000 GPUs, losing one of my favorite features in the process

    January 7, 2025

    Xfce 4.20 Released with New Features, Settings + More

    December 16, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.