Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    April 9, 2025

    In a significant move to empower developers and teams working with large language models (LLMs), OpenAI has introduced the Evals API, a new toolset that brings programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via the OpenAI dashboard, the new API allows developers to define tests, automate evaluation runs, and iterate on prompts directly from their workflows.

    Why the Evals API Matters

    Evaluating LLM performance has often been a manual, time-consuming process, especially for teams scaling applications across diverse domains. With the Evals API, OpenAI provides a systematic approach to:

    • Assess model performance on custom test cases
    • Measure improvements across prompt iterations
    • Automate quality assurance in development pipelines

    Now, every developer can treat evaluation as a first-class citizen in the development cycle—similar to how unit tests are treated in traditional software engineering.

    Core Features of the Evals API

    1. Custom Eval Definitions: Developers can write their own evaluation logic by extending base classes.
    2. Test Data Integration: Seamlessly integrate evaluation datasets to test specific scenarios.
    3. Parameter Configuration: Configure model, temperature, max tokens, and other generation parameters.
    4. Automated Runs: Trigger evaluations via code, and retrieve results programmatically.

    The Evals API supports a YAML-based configuration structure, allowing for both flexibility and reusability.

    Getting Started with the Evals API

    To use the Evals API, you first install the OpenAI Python package:

    Copy CodeCopiedUse a different Browser
    pip install openai

    Then, you can run an evaluation using a built-in eval, such as factuality_qna

    Copy CodeCopiedUse a different Browser
    oai evals registry:evaluation:factuality_qna 
      --completion_fns gpt-4 
      --record_path eval_results.jsonl

    Or define a custom eval in Python:

    Copy CodeCopiedUse a different Browser
    import openai.evals
    
    class MyRegressionEval(openai.evals.Eval):
        def run(self):
            for example in self.get_examples():
                result = self.completion_fn(example['input'])
                score = self.compute_score(result, example['ideal'])
                yield self.make_result(result=result, score=score)

    This example shows how you can define a custom evaluation logic—in this case, measuring regression accuracy.

    Use Case: Regression Evaluation

    OpenAI’s cookbook example walks through building a regression evaluator using the API. Here’s a simplified version:

    Copy CodeCopiedUse a different Browser
    from sklearn.metrics import mean_squared_error
    
    class RegressionEval(openai.evals.Eval):
        def run(self):
            predictions, labels = [], []
            for example in self.get_examples():
                response = self.completion_fn(example['input'])
                predictions.append(float(response.strip()))
                labels.append(example['ideal'])
            mse = mean_squared_error(labels, predictions)
            yield self.make_result(result={"mse": mse}, score=-mse)

    This allows developers to benchmark numerical predictions from models and track changes over time.

    Seamless Workflow Integration

    Whether you’re building a chatbot, summarization engine, or classification system, evaluations can now be triggered as part of your CI/CD pipeline. This ensures that every prompt or model update maintains or improves performance before going live.

    Copy CodeCopiedUse a different Browser
    openai.evals.run(
      eval_name="my_eval",
      completion_fn="gpt-4",
      eval_config={"path": "eval_config.yaml"}
    )

    Conclusion

    The launch of the Evals API marks a shift toward robust, automated evaluation standards in LLM development. By offering the ability to configure, run, and analyze evaluations programmatically, OpenAI is enabling teams to build with confidence and continuously improve the quality of their AI applications.

    To explore further, check out the official OpenAI Evals documentation and the cookbook examples.

    The post OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMulti-LLM routing strategies for generative AI applications on AWS
    Next Article Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    OpenAI surpasses 400 million weekly users even with DeepSeek’s cost efficient model in the fold

    News & Updates

    KB5055625 tests Windows 11’s Show smaller taskbar buttons feature

    Operating Systems

    CVE-2025-4076 – LB-LINK BL-AC3600 Password Handler Command Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    AI in API Testing: Revolutionizing Your Testing Strategy

    Development

    Highlights

    I tested JBL’s newest premium headphones – Bose and Sony should watch out

    April 17, 2025

    The JBL Tour One M3 are the latest generation of the company’s flagship headphones, and…

    Payroll Processing Checklist

    January 17, 2025

    Prism – lightweight, extensible syntax highlighter

    February 1, 2025

    The best AI chatbots for programming, and a bunch that failed miserably

    August 13, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.