Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      A Week In The Life Of An AI-Augmented Designer

      August 22, 2025

      This week in AI updates: Gemini Code Assist Agent Mode, GitHub’s Agents panel, and more (August 22, 2025)

      August 22, 2025

      Microsoft adds Copilot-powered debugging features for .NET in Visual Studio

      August 21, 2025

      Blackstone portfolio company R Systems Acquires Novigo Solutions, Strengthening its Product Engineering and Full-Stack Agentic-AI Capabilities

      August 21, 2025

      I found the ultimate MacBook Air alternative for Windows users – and it’s priced well

      August 23, 2025

      Outdated IT help desks are holding businesses back – but there is a solution

      August 23, 2025

      Android’s latest update can force apps into dark mode – how to see it now

      August 23, 2025

      I tried the Google Pixel Watch 4 – and these key features made it feel indispensable

      August 23, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Building Cross-Platform Alerts with Laravel’s Notification Framework

      August 23, 2025
      Recent

      Building Cross-Platform Alerts with Laravel’s Notification Framework

      August 23, 2025

      Add Notes Functionality to Eloquent Models With the Notable Package

      August 23, 2025

      How to install OpenPlatform — IoT platform

      August 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Basics of Digital Forensics

      August 22, 2025
      Recent

      Basics of Digital Forensics

      August 22, 2025

      Top Linux Server Automation Tools: Simplifying System Administration

      August 22, 2025

      Rising from the Ashes: How AlmaLinux and Rocky Linux Redefined the Post-CentOS Landscape

      August 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    April 9, 2025

    In a significant move to empower developers and teams working with large language models (LLMs), OpenAI has introduced the Evals API, a new toolset that brings programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via the OpenAI dashboard, the new API allows developers to define tests, automate evaluation runs, and iterate on prompts directly from their workflows.

    Why the Evals API Matters

    Evaluating LLM performance has often been a manual, time-consuming process, especially for teams scaling applications across diverse domains. With the Evals API, OpenAI provides a systematic approach to:

    • Assess model performance on custom test cases
    • Measure improvements across prompt iterations
    • Automate quality assurance in development pipelines

    Now, every developer can treat evaluation as a first-class citizen in the development cycle—similar to how unit tests are treated in traditional software engineering.

    Core Features of the Evals API

    1. Custom Eval Definitions: Developers can write their own evaluation logic by extending base classes.
    2. Test Data Integration: Seamlessly integrate evaluation datasets to test specific scenarios.
    3. Parameter Configuration: Configure model, temperature, max tokens, and other generation parameters.
    4. Automated Runs: Trigger evaluations via code, and retrieve results programmatically.

    The Evals API supports a YAML-based configuration structure, allowing for both flexibility and reusability.

    Getting Started with the Evals API

    To use the Evals API, you first install the OpenAI Python package:

    Copy CodeCopiedUse a different Browser
    pip install openai

    Then, you can run an evaluation using a built-in eval, such as factuality_qna

    Copy CodeCopiedUse a different Browser
    oai evals registry:evaluation:factuality_qna 
      --completion_fns gpt-4 
      --record_path eval_results.jsonl

    Or define a custom eval in Python:

    Copy CodeCopiedUse a different Browser
    import openai.evals
    
    class MyRegressionEval(openai.evals.Eval):
        def run(self):
            for example in self.get_examples():
                result = self.completion_fn(example['input'])
                score = self.compute_score(result, example['ideal'])
                yield self.make_result(result=result, score=score)

    This example shows how you can define a custom evaluation logic—in this case, measuring regression accuracy.

    Use Case: Regression Evaluation

    OpenAI’s cookbook example walks through building a regression evaluator using the API. Here’s a simplified version:

    Copy CodeCopiedUse a different Browser
    from sklearn.metrics import mean_squared_error
    
    class RegressionEval(openai.evals.Eval):
        def run(self):
            predictions, labels = [], []
            for example in self.get_examples():
                response = self.completion_fn(example['input'])
                predictions.append(float(response.strip()))
                labels.append(example['ideal'])
            mse = mean_squared_error(labels, predictions)
            yield self.make_result(result={"mse": mse}, score=-mse)

    This allows developers to benchmark numerical predictions from models and track changes over time.

    Seamless Workflow Integration

    Whether you’re building a chatbot, summarization engine, or classification system, evaluations can now be triggered as part of your CI/CD pipeline. This ensures that every prompt or model update maintains or improves performance before going live.

    Copy CodeCopiedUse a different Browser
    openai.evals.run(
      eval_name="my_eval",
      completion_fn="gpt-4",
      eval_config={"path": "eval_config.yaml"}
    )

    Conclusion

    The launch of the Evals API marks a shift toward robust, automated evaluation standards in LLM development. By offering the ability to configure, run, and analyze evaluations programmatically, OpenAI is enabling teams to build with confidence and continuously improve the quality of their AI applications.

    To explore further, check out the official OpenAI Evals documentation and the cookbook examples.

    The post OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMulti-LLM routing strategies for generative AI applications on AWS
    Next Article Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 23, 2025
    Machine Learning

    Checklists Are Better Than Reward Models For Aligning Language Models

    August 23, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Five iOS 26 features I already can’t live without – and how to access them

    News & Updates

    Meet CoAct-1: A Novel Multi-Agent System that Synergistically Combines GUI-based Control with Direct Programmatic Execution

    Machine Learning

    CVE-2010-10015 – AOL Phobos.dll Stack-Based Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft Admits Windows 11 Firewall Bug Still Isn’t Fixed

    Operating Systems

    Highlights

    APPLE-SA-04-16-2025-2 macOS Sequoia 15.4.1

    April 24, 2025

    APPLE-SA-04-16-2025-2 macOS Sequoia 15.4.1

    Full Disclosure
    mailing list archives
    From: Apple Product Security via Fulldisclosure
    Date: Wed, 16 Apr 2025 13:53:17 -0700
    —–BEGIN PGP SIGNED MESSAGE—–
    Hash: SH …
    Read more

    Published Date:
    Apr 24, 2025 (4 hours, 26 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-31201

    CVE-2025-31200

    Want to protect your phone’s battery? Stop doing this one simple thing

    April 3, 2025

    CVE-2025-20129 – Cisco Customer Collaboration Platform (CCP) HTTP Request Manipulation Vulnerability

    June 4, 2025

    CVE-2025-31256 – Apple Notes Cache Exposure Vulnerability

    May 12, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.