Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Power Of The Intl API: A Definitive Guide To Browser-Native Internationalization

      August 8, 2025

      This week in AI dev tools: GPT-5, Claude Opus 4.1, and more (August 8, 2025)

      August 8, 2025

      Elastic simplifies log analytics for SREs and developers with launch of Log Essentials

      August 7, 2025

      OpenAI launches GPT-5

      August 7, 2025

      I compared the best headphones from Apple, Sony, Bose, and Sonos: Here’s how the AirPods Max wins

      August 10, 2025

      I changed these 6 settings on my iPad to significantly improve its battery life

      August 10, 2025

      DistroWatch Weekly, Issue 1134

      August 10, 2025

      3 portable power stations I travel everywhere with (and how they differ)

      August 9, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Next.js PWA offline capability with Service Worker, no extra package

      August 10, 2025
      Recent

      Next.js PWA offline capability with Service Worker, no extra package

      August 10, 2025

      spatie/laravel-flare

      August 9, 2025

      Establishing Consistent Data Foundations with Laravel’s Database Population System

      August 8, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 Copilot gets free access to GPT-5 Thinking, reduced rate limits than ChatGPT Free

      August 10, 2025
      Recent

      Windows 11 Copilot gets free access to GPT-5 Thinking, reduced rate limits than ChatGPT Free

      August 10, 2025

      Best Architecture AI Rendering Platform: 6 Tools Tested

      August 10, 2025

      Microsoft won’t kill off Chromium Edge and PWAs on Windows 10 until October 2028

      August 10, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Using RouteLLM to Optimize LLM Usage

    Using RouteLLM to Optimize LLM Usage

    August 10, 2025

    RouteLLM is a flexible framework for serving and evaluating LLM routers, designed to maximize performance while minimizing cost.

    Key features:

    • Seamless integration — Acts as a drop-in replacement for the OpenAI client or runs as an OpenAI-compatible server, intelligently routing simpler queries to cheaper models.
    • Pre-trained routers out of the box — Proven to cut costs by up to 85% while preserving 95% of GPT-4 performance on widely used benchmarks like MT-Bench.
    • Cost-effective excellence — Matches the performance of leading commercial offerings while being over 40% cheaper.
    • Extensible and customizable — Easily add new routers, fine-tune thresholds, and compare performance across multiple benchmarks.
    Source: https://github.com/lm-sys/RouteLLM/tree/main

    In this tutorial, we’ll walk through how to:

    • Load and use a pre-trained router.
    • Calibrate it for your own use case.
    • Test routing behavior on different types of prompts.
    • Check out the Full Codes here.

    Installing the dependencies

    Copy CodeCopiedUse a different Browser
    !pip install "routellm[serve,eval]"

    Loading OpenAI API Key

    To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

    RouteLLM leverages LiteLLM to support chat completions from a wide range of both open-source and closed-source models. You can check out the list of providers at https://litellm.vercel.app/docs/providers if you want to use some other model. Check out the Full Codes here.

    Copy CodeCopiedUse a different Browser
    import os
    from getpass import getpass
    os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

    Downloading Config File

    RouteLLM uses a configuration file to locate pretrained router checkpoints and the datasets they were trained on. This file tells the system where to find the models that decide whether to send a query to the strong or weak model. Check out the Full Codes here.

    Do I need to edit it?

    For most users — no. The default config already points to well-trained routers (mf, bert, causal_llm) that work out of the box. You only need to change it if you plan to:

    • Train your own router on a custom dataset.
    • Replace the routing algorithm entirely with a new one.

    For this tutorial, we’ll keep the config as is and simply:

    • Set our strong and weak model names in code.
    • Add our API keys for the chosen providers.
    • Use a calibrated threshold to balance cost and quality.
    • Check out the Full Codes here.
    Copy CodeCopiedUse a different Browser
    !wget https://raw.githubusercontent.com/lm-sys/RouteLLM/main/config.example.yaml

    Initializing the RouteLLM Controller

    In this code block, we import the necessary libraries and initialize the RouteLLM Controller, which will manage how prompts are routed between models. We specify routers=[“mf”] to use the Matrix Factorization router, a pretrained decision model that predicts whether a query should be sent to the strong or weak model.

    The strong_model parameter is set to “gpt-5”, a high-quality but more expensive model, while the weak_model parameter is set to “o4-mini”, a faster and cheaper alternative. For each incoming prompt, the router evaluates its complexity against a threshold and automatically chooses the most cost-effective option—ensuring that simple tasks are handled by the cheaper model while more challenging ones get the stronger model’s capabilities.

    This configuration allows you to balance cost efficiency and response quality without manual intervention. Check out the Full Codes here.

    Copy CodeCopiedUse a different Browser
    import os
    import pandas as pd
    from routellm.controller import Controller
    
    client = Controller(
        routers=["mf"],  # Model Fusion router
        strong_model="gpt-5",       
        weak_model="o4-mini"     
    )
    Copy CodeCopiedUse a different Browser
    !python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.1 --config config.example.yaml

    This command runs RouteLLM’s threshold calibration process for the Matrix Factorization (mf) router. The –strong-model-pct 0.1 argument tells the system to find the threshold value that routes roughly 10% of queries to the strong model (and the rest to the weak model).

    Using the –config config.example.yaml file for model and router settings, the calibration determined:

    For 10% strong model calls with mf, the optimal threshold is 0.24034.

    This means that any query with a router-assigned complexity score above 0.24034 will be sent to the strong model, while those below it will go to the weak model, aligning with your desired cost–quality trade-off.

    Defining the threshold & prompts variables

    Here, we define a diverse set of test prompts designed to cover a range of complexity levels. They include simple factual questions (likely to be routed to the weak model), medium reasoning tasks (borderline threshold cases), and high-complexity or creative requests (more suited for the strong model), along with code generation tasks to test technical capabilities. Check out the Full Codes here.

    Copy CodeCopiedUse a different Browser
    threshold = 0.24034
    
    prompts = [
        # Easy factual (likely weak model)
        "Who wrote the novel 'Pride and Prejudice'?",
        "What is the largest planet in our solar system?",
        
        # Medium reasoning (borderline cases)
        "If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",
        "Explain why the sky appears blue during the day and red/orange during sunset.",
        
        # High complexity / creative (likely strong model)
        "Write a 6-line rap verse about climate change using internal rhyme.",
        "Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",
        
        # Code generation
        "Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",
        "Generate SQL to find the top 3 highest-paying customers from a 'sales' table."
    ]

    Evaluating Win Rate

    The following code calculates the win rate for each test prompt using the mf router, showing the likelihood that the strong model will outperform the weak model.

    Based on the calibrated threshold of 0.24034, two prompts —

    “If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?” (0.303087)

    “Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.” (0.272534)

    — exceed the threshold and would be routed to the strong model.

    All other prompts remain below the threshold, meaning they would be served by the weaker, cheaper model. Check out the Full Codes here.

    Copy CodeCopiedUse a different Browser
    win_rates = client.batch_calculate_win_rate(prompts=pd.Series(prompts), router="mf")
    
    # Store results in DataFrame
    _df = pd.DataFrame({
        "Prompt": prompts,
        "Win_Rate": win_rates
    })
    
    # Show full text without truncation
    pd.set_option('display.max_colwidth', None)

    These results also help in fine-tuning the routing strategy — by analyzing the win rate distribution, we can adjust the threshold to better balance cost savings and performance.

    Routing Prompts Through Calibrated Model Fusion (MF) Router

    This code iterates over the list of test prompts and sends each one to the RouteLLM controller using the calibrated mf router with the specified threshold (router-mf-{threshold}).

    For each prompt, the router decides whether to use the strong or weak model based on the calculated win rate.

    The response includes both the generated output and the actual model that was selected by the router.

    These details — the prompt, model used, and generated output — are stored in the results list for later analysis. Check out the Full Codes here.

    Copy CodeCopiedUse a different Browser
    results = []
    for prompt in prompts:
        response = client.chat.completions.create(
            model=f"router-mf-{threshold}",
            messages=[{"role": "user", "content": prompt}]
        )
        message = response.choices[0].message["content"]
        model_used = response.model  # RouteLLM returns the model actually used
        
        results.append({
            "Prompt": prompt,
            "Model Used": model_used,
            "Output": message
        })
    
    df = pd.DataFrame(results)

    In the results, prompts 2 and 6 exceeded the threshold win rate and were therefore routed to the gpt-5 strong model, while the rest were handled by the weaker model.


    Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    🇬 Star us on GitHub
    🇷 Join our ML Subreddit
    🇸 Sponsor us

    The post Using RouteLLM to Optimize LLM Usage appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAI-Driven Antitrust and Competition Law: Algorithmic Collusion, Self-Learning Pricing Tools, and Legal Challenges in the US and EU
    Next Article From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 10, 2025
    Machine Learning

    AI Agent Trends of 2025: A Transformative Landscape

    August 10, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Diabetes Detection System using PHP and MYSQL

    Development

    CVE-2025-43579 – Adobe Acrobat Reader Information Exposure Security Feature Bypass

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4103 – WordPress WP-GeoMeta Privilege Escalation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    AIOps in Action: How AI-Driven Automation Is Revolutionizing IT Operations⚙️

    Web Development

    Highlights

    Linux

    Rilasciato Calibre 8.8: Una Panoramica Completa delle Novità e delle Funzionalità Aggiuntive

    August 8, 2025

    Calibre è un software open source per la gestione completa di libri elettronici (eBook), noto…

    CVE-2025-46753 – Cisco Webex Meeting Server Authentication Bypass

    April 29, 2025

    How ZURU improved the accuracy of floor plan generation by 109% using Amazon Bedrock and Amazon SageMaker

    May 30, 2025

    Tap into Your PHP Potential with Free Projects at PHPGurukul

    May 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.