A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

Evaluating LLMs has emerged as a pivotal challenge in advancing the reliability and utility of artificial intelligence across both academic and industrial settings. As the capabilities of these models expand, so too does the need for rigorous, reproducible, and multi-faceted evaluation methodologies. In this tutorial, we provide a comprehensive examination of one of the field’s most critical frontiers: systematically evaluating the strengths and limitations of LLMs across various dimensions of performance. Using Google’s cutting-edge Generative AI models as benchmarks and the LangChain library as our orchestration tool, we present a robust and modular evaluation pipeline tailored for implementation in Google Colab. This framework integrates criterion-based scoring, encompassing correctness, relevance, coherence, and conciseness, with pairwise model comparisons and rich visual analytics to deliver nuanced and actionable insights. Grounded in expert-validated question sets and objective ground truth answers, this approach balances quantitative rigor with practical adaptability, offering researchers and developers a ready-to-use, extensible toolkit for high-fidelity LLM evaluation.

Copy CodeCopiedUse a different Browser

!pip install langchain langchain-google-genai ragas pandas matplotlib

We install key Python libraries for building and running AI-powered workflows, LangChain for orchestrating LLM interactions (with the langchain-google-genai extension for Google’s generative AI), Ragas for retrieval-augmented generation, and pandas plus matplotlib for data manipulation and visualization.

Copy CodeCopiedUse a different Browser

import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage

We incorporate core Python utilities, including os for environment management, pandas for handling DataFrames, and matplotlib.pyplot for plotting, alongside LangChain’s Google Generative AI client, prompt templating, chain construction, evaluator loader, and the HumanMessage schema to build and assess conversational LLM pipelines.

Copy CodeCopiedUse a different Browser

os.environ["GOOGLE_API_KEY"] = "Use Your API Key"

Here, we configure your environment by storing your Google API key in the GOOGLE_API_KEY variable, allowing the LangChain Google Generative AI client to authenticate requests securely.

Copy CodeCopiedUse a different Browser

def create_evaluation_dataset():
    """Create a simple dataset for evaluation."""
    questions = [
        "Explain the concept of quantum computing in simple terms.",
        "How does a neural network learn?",
        "What are the main differences between SQL and NoSQL databases?",
        "Explain how blockchain technology works.",
        "What is the difference between supervised and unsupervised learning?"
    ]
   
    ground_truth = [
        "Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
        "Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
        "SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
        "Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
        "Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
    ]
   
    return pd.DataFrame({"question": questions, "ground_truth": ground_truth})

We construct a small evaluation DataFrame by pairing five example questions on AI and database concepts with their corresponding ground‑truth answers, making it easy to benchmark an LLM’s responses against predefined correct outputs.

Copy CodeCopiedUse a different Browser

def setup_models():
    """Set up different Google Generative AI models for comparison."""
    models = {
        "gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
        "gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
    }
    return models

Now, this function instantiates two zero‑temperature ChatGoogleGenerativeAI clients, one using the lightweight “gemini‑2.0‑flash‑lite” model and the other the full “gemini‑2.0‑flash” model, so you can easily compare their outputs side‑by‑side.

Copy CodeCopiedUse a different Browser

def generate_responses(models, dataset):
    """Generate responses from each model for the questions in the dataset."""
    responses = {}
   
    for model_name, model in models.items():
        model_responses = []
        for question in dataset["question"]:
            try:
                response = model.invoke([HumanMessage(content=question)])
                model_responses.append(response.content)
            except Exception as e:
                print(f"Error with model {model_name} on question: {question}")
                print(f"Error: {e}")
                model_responses.append("Error generating response")
       
        responses[model_name] = model_responses
   
    return responses

This function loops through each configured model and each question in the dataset, invokes the model to generate a response, catches any errors (logging them and inserting a placeholder), and returns a dictionary mapping each model’s name to its list of generated answers.

Copy CodeCopiedUse a different Browser

def evaluate_responses(models, dataset, responses):
    """Evaluate model responses using different evaluation criteria."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    reference_criteria = ["correctness"]
    reference_free_criteria = [
        "relevance",  
        "coherence",    
        "conciseness"  
    ]
   
    results = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
               for model_name in models.keys()}
   
    for criterion in reference_criteria:
        evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                ground_truth = dataset["ground_truth"][i]
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        reference=ground_truth,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
   
    for criterion in reference_free_criteria:
        evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
    return results

This function leverages a “gemini‑2.0‑flash‑lite” evaluator to score each model’s answers on both reference‑based correctness and reference‑free metrics (relevance, coherence, conciseness), normalizes those scores, and returns a nested dictionary mapping each model and criterion to its list of evaluation results.

Copy CodeCopiedUse a different Browser

def calculate_average_scores(evaluation_results):
    """Calculate average scores for each model and criterion."""
    avg_scores = {}
   
    for model_name, criteria in evaluation_results.items():
        avg_scores[model_name] = {}
       
        for criterion, scores in criteria.items():
            if scores:
                avg_scores[model_name][criterion] = sum(scores) / len(scores)
            else:
                avg_scores[model_name][criterion] = 0
               
        all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
        if all_scores:
            avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
        else:
            avg_scores[model_name]["overall"] = 0
           
    return avg_scores

This function processes the nested evaluation results to compute the mean score for each criterion across all questions for every model. Also, it calculates an overall average by pooling all individual metric scores. The returned dictionary maps each model to its per‑criterion averages and an aggregated “overall” performance score.

Copy CodeCopiedUse a different Browser

def visualize_results(avg_scores):
    """Visualize evaluation results with bar charts."""
    models = list(avg_scores.keys())
    criteria = list(avg_scores[models[0]].keys())
   
    plt.figure(figsize=(14, 8))
   
    bar_width = 0.8 / len(models)
   
    positions = range(len(criteria))
   
    for i, model in enumerate(models):
        model_scores = [avg_scores[model][criterion] for criterion in criteria]
        plt.bar([p + i * bar_width for p in positions], model_scores,
                width=bar_width, label=model)
   
    plt.xlabel('Evaluation Criteria', fontsize=12)
    plt.ylabel('Average Score (0-10)', fontsize=12)
    plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
    plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], criteria)
    plt.legend()
    plt.grid(axis='y', linestyle='--', alpha=0.7)
   
    plt.tight_layout()
    plt.show()
   
    plt.figure(figsize=(10, 8))
   
    categories = [c for c in criteria if c != 'overall']
    N = len(categories)
   
    angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
    angles += angles[:1]  
   
    plt.polar(angles, [0] * (N + 1))
    plt.xticks(angles[:-1], categories)
   
    for model in models:
        values = [avg_scores[model][c] for c in categories]
        values += values[:1]  
        plt.polar(angles, values, label=model)
   
    plt.legend(loc='upper right')
    plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
    plt.tight_layout()
    plt.show()

This function creates side-by-side bar charts to compare each model’s average scores across all evaluation criteria. Then it renders a radar chart to visualize their performance profiles, enabling quick identification of relative strengths and weaknesses.

Copy CodeCopiedUse a different Browser

def main():
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("nVisualizing results...")
    visualize_results(avg_scores)
   
    print("Saving results to CSV...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
    print("Results saved to llm_evaluation_results.csv")
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
    print("Detailed responses saved to llm_response_comparison.csv")

The main function orchestrates the entire evaluation workflow end‑to‑end: it builds the dataset, initializes models, generates and scores responses, computes and displays average metrics, visualizes performance with charts, and finally exports both summary and detailed results as CSV files.

Copy CodeCopiedUse a different Browser

def pairwise_model_comparison(models, dataset, responses):
    """Compare two models side by side using an LLM as judge."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    pairwise_template = """
    Question: {question}
   
    Response A: {response_a}
   
    Response B: {response_b}
   
    Which response better answers the user's question? Consider factors like accuracy,
    helpfulness, clarity, and completeness.
   
    First, analyze each response point by point. Then conclude with your choice of either:
    A is better, B is better, or They are equally good/bad.
   
    Your analysis:
    """
   
    pairwise_prompt = PromptTemplate(
        input_variables=["question", "response_a", "response_b"],
        template=pairwise_template
    )
   
    pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
   
    model_names = list(models.keys())
   
    pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
   
    for i, question in enumerate(dataset["question"]):
        for j, model_a in enumerate(model_names):
            for model_b in model_names[j+1:]:  
                response_a = responses[model_a][i]
                response_b = responses[model_b][i]
               
                if response_a != "Error generating response" and response_b != "Error generating response":
                    comparison_result = pairwise_chain.run(
                        question=question,
                        response_a=response_a,
                        response_b=response_b
                    )
                   
                    key_ab = f"{model_a} vs {model_b}"
                    pairwise_results[key_ab].append({
                        "question": question,
                        "result": comparison_result
                    })
   
    return pairwise_results

This function runs head-to-head comparisons for each unique model pair by prompting a “gemini-2.0-flash-lite” judge to analyze and rank their responses on accuracy, clarity, and completeness, collecting per-question verdicts into a structured dictionary for side-by-side evaluation.

Copy CodeCopiedUse a different Browser

def enhanced_main():
    """Enhanced main function with additional evaluations."""
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("nVisualizing results...")
    visualize_results(avg_scores)
   
    print("nPerforming pairwise model comparison...")
    pairwise_results = pairwise_model_comparison(models, dataset, responses)
   
    print("nPairwise comparison results:")
    for comparison, results in pairwise_results.items():
        print(f"n{comparison}:")
        for i, result in enumerate(results[:2]):
            print(f"  Question {i+1}: {result['question']}")
            print(f"  Analysis: {result['result'][:100]}...")
   
    print("nSaving all results...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
   
    pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
   
    for comparison, results in pairwise_results.items():
        for result in results:
            pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
                "Comparison": comparison,
                "Question": result["question"],
                "Analysis": result["result"]
            }])], ignore_index=True)
   
    pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
   
    print("All results saved to CSV files.")

The enhanced_main function extends the core evaluation pipeline by adding automated pairwise model comparisons, printing concise progress updates at each stage, and exporting three CSV files, summary scores, detailed responses, and pairwise analysis , so you end up with a complete, side-by-side evaluation workspace.

Copy CodeCopiedUse a different Browser

if __name__ == "__main__":
    enhanced_main()

Finally, this guard ensures that when the script is executed directly (not imported), it calls enhanced_main() to run the full evaluation and comparison pipeline end‑to‑end.

In conclusion, in this tutorial has introduced a versatile and principled framework for evaluating and comparing the performance of LLMs, leveraging Google’s Generative AI capabilities alongside the LangChain library for orchestration. Unlike simplistic accuracy-based metrics, the methodology presented here embraces the multidimensional nature of language understanding, combining granular criterion-based evaluation, structured model-to-model comparison, and intuitive visualizations. By capturing key attributes, including correctness, relevance, coherence, and conciseness, our evaluation pipeline enables practitioners to identify subtle yet significant performance differences that directly impact downstream applications. The outputs, including CSV-based reporting, radar plots, and bar graphs, not only support transparent benchmarking but also guide data-driven decision-making in model selection and deployment.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain appeared first on MarkTechPost.

Source: Read MoreÂ

Designing For TV: Principles, Patterns And Practical Guidance (Part 2)

Neo4j introduces new graph architecture that allows operational and analytics workloads to be run together

Beyond the benchmarks: Understanding the coding personalities of different LLMs

Top 10 Use Cases of Vibe Coding in Large-Scale Node.js Applications

Building smarter interactions with MCP elicitation: From clunky tool calls to seamless user experiences

From Zero to MCP: Simplifying AI Integrations with xmcp

Distribution Release: Linux Mint 22.2

Coded Smorgasbord: Basically, a Smorgasbord

Drupal 11’s AI Features: What They Actually Mean for Your Team

Drupal 11’s AI Features: What They Actually Mean for Your Team

Why Data Governance Matters More Than Ever in 2025?

Perficient Included in the IDC Market Glance for Digital Business Professional Services, 3Q25

How DevOps Teams Are Redefining Reliability with NixOS and OSTree-Powered Linux

How DevOps Teams Are Redefining Reliability with NixOS and OSTree-Powered Linux

Distribution Release: Linux Mint 22.2

‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

Apple ‘AirBorne’ flaws can lead to zero-click AirPlay RCE attacks

Top 10 Drone Show Ideas That Will Wow Your Wedding Guests in 2025

How Code Feedback MCP Enhances AI-Generated Code Quality

Word2IPA – turn words into their true sounds

Google’s New IDE Redefines Developer Productivity

CVE-2025-45797 – TOTOlink A950RG Buffer Overflow Vulnerability in NoticeUrl Parameter

Xbox Unveils Big Lineup at Gamescom 2025, Including Fallout TV Show Teaser

Google AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): A Human-Inspired Diffusion Framework for Advanced Deep Research Agents

A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

Related Posts