Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Representative Line: Brace Yourself

      September 18, 2025

      Beyond the Pilot: A Playbook for Enterprise-Scale Agentic AI

      September 18, 2025

      GitHub launches MCP Registry to provide central location for trusted servers

      September 18, 2025

      MongoDB brings Search and Vector Search to self-managed versions of database

      September 18, 2025

      Distribution Release: Security Onion 2.4.180

      September 18, 2025

      Distribution Release: Omarchy 3.0.1

      September 17, 2025

      Distribution Release: Mauna Linux 25

      September 16, 2025

      Distribution Release: SparkyLinux 2025.09

      September 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      AI Momentum and Perficient’s Inclusion in Analyst Reports – Highlights From 2025 So Far

      September 18, 2025
      Recent

      AI Momentum and Perficient’s Inclusion in Analyst Reports – Highlights From 2025 So Far

      September 18, 2025

      Shopping Portal using Python Django & MySQL

      September 17, 2025

      Perficient Earns Adobe’s Real-time CDP Specialization

      September 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Valve Survey Reveals Slight Retreat in Steam-on-Linux Share

      September 18, 2025
      Recent

      Valve Survey Reveals Slight Retreat in Steam-on-Linux Share

      September 18, 2025

      Review: Elecrow’s All-in-one Starter Kit for Pico 2

      September 18, 2025

      FOSS Weekly #25.38: GNOME 49 Release, KDE Drama, sudo vs sudo-rs, Local AI on Android and More Linux Stuff

      September 18, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    April 9, 2025

    In a significant move to empower developers and teams working with large language models (LLMs), OpenAI has introduced the Evals API, a new toolset that brings programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via the OpenAI dashboard, the new API allows developers to define tests, automate evaluation runs, and iterate on prompts directly from their workflows.

    Why the Evals API Matters

    Evaluating LLM performance has often been a manual, time-consuming process, especially for teams scaling applications across diverse domains. With the Evals API, OpenAI provides a systematic approach to:

    • Assess model performance on custom test cases
    • Measure improvements across prompt iterations
    • Automate quality assurance in development pipelines

    Now, every developer can treat evaluation as a first-class citizen in the development cycle—similar to how unit tests are treated in traditional software engineering.

    Core Features of the Evals API

    1. Custom Eval Definitions: Developers can write their own evaluation logic by extending base classes.
    2. Test Data Integration: Seamlessly integrate evaluation datasets to test specific scenarios.
    3. Parameter Configuration: Configure model, temperature, max tokens, and other generation parameters.
    4. Automated Runs: Trigger evaluations via code, and retrieve results programmatically.

    The Evals API supports a YAML-based configuration structure, allowing for both flexibility and reusability.

    Getting Started with the Evals API

    To use the Evals API, you first install the OpenAI Python package:

    Copy CodeCopiedUse a different Browser
    pip install openai

    Then, you can run an evaluation using a built-in eval, such as factuality_qna

    Copy CodeCopiedUse a different Browser
    oai evals registry:evaluation:factuality_qna 
      --completion_fns gpt-4 
      --record_path eval_results.jsonl

    Or define a custom eval in Python:

    Copy CodeCopiedUse a different Browser
    import openai.evals
    
    class MyRegressionEval(openai.evals.Eval):
        def run(self):
            for example in self.get_examples():
                result = self.completion_fn(example['input'])
                score = self.compute_score(result, example['ideal'])
                yield self.make_result(result=result, score=score)

    This example shows how you can define a custom evaluation logic—in this case, measuring regression accuracy.

    Use Case: Regression Evaluation

    OpenAI’s cookbook example walks through building a regression evaluator using the API. Here’s a simplified version:

    Copy CodeCopiedUse a different Browser
    from sklearn.metrics import mean_squared_error
    
    class RegressionEval(openai.evals.Eval):
        def run(self):
            predictions, labels = [], []
            for example in self.get_examples():
                response = self.completion_fn(example['input'])
                predictions.append(float(response.strip()))
                labels.append(example['ideal'])
            mse = mean_squared_error(labels, predictions)
            yield self.make_result(result={"mse": mse}, score=-mse)

    This allows developers to benchmark numerical predictions from models and track changes over time.

    Seamless Workflow Integration

    Whether you’re building a chatbot, summarization engine, or classification system, evaluations can now be triggered as part of your CI/CD pipeline. This ensures that every prompt or model update maintains or improves performance before going live.

    Copy CodeCopiedUse a different Browser
    openai.evals.run(
      eval_name="my_eval",
      completion_fn="gpt-4",
      eval_config={"path": "eval_config.yaml"}
    )

    Conclusion

    The launch of the Evals API marks a shift toward robust, automated evaluation standards in LLM development. By offering the ability to configure, run, and analyze evaluations programmatically, OpenAI is enabling teams to build with confidence and continuously improve the quality of their AI applications.

    To explore further, check out the official OpenAI Evals documentation and the cookbook examples.

    The post OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMulti-LLM routing strategies for generative AI applications on AWS
    Next Article Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best early Prime Day headphones deals: My 12 favorite sales live now

    News & Updates

    CVE-2025-49740 – Microsoft Windows SmartScreen Bypass Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft Brings GPT-5 Across Its Products, Starting Today

    Operating Systems

    CVE-2025-30025 – Apache Service Control Local Privilege Escalation

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-48288 – Element Invader Elementor Stored Cross-Site Scripting

    May 19, 2025

    CVE ID : CVE-2025-48288

    Published : May 19, 2025, 3:15 p.m. | 1 hour, 13 minutes ago

    Description : Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability in Element Invader ElementInvader Addons for Elementor allows Stored XSS. This issue affects ElementInvader Addons for Elementor: from n/a through 1.3.5.

    Severity: 6.5 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-4216 – DIOT SCADA with MQTT WordPress Stored Cross-Site Scripting Vulnerability

    June 14, 2025

    DeepSeek-AI Released DeepSeek-Prover-V2: An Open-Source Large Language Model Designed for Formal Theorem, Proving through Subgoal Decomposition and Reinforcement Learning

    May 1, 2025

    Rilasciata Grml 2025.08: La Nuova Versione che Celebra 32 anni di Debian

    August 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.