Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      10 Top Node.js Development Companies for Enterprise-Scale Projects (2025-2026 Ranked & Reviewed)

      July 4, 2025

      12 Must-Know Cost Factors When Hiring Node.js Developers for Your Enterprise

      July 4, 2025

      Mirantis reveals Lens Prism, an AI copilot for operating Kubernetes clusters

      July 3, 2025

      Avoid these common platform engineering mistakes

      July 3, 2025

      RIP, Perfect Dark — Xbox leadership canceled my most-anticipated game, and the developers deserved better

      July 6, 2025

      I keep seeing people at events taking notes on E-Ink tablets — so I tried one to see what all the fuss is about

      July 6, 2025

      “A fantastic device for creative users” — this $550 discount on ASUS’s 3K OLED creator laptop disappears before Prime Day

      July 5, 2025

      Distribution Release: Rhino Linux 2025.3

      July 5, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Token System using PHP and MySQL

      July 6, 2025
      Recent

      Token System using PHP and MySQL

      July 6, 2025

      Create React UI component with uncontrollable

      July 6, 2025

      Flaget – new small 5kB CLI argument parser

      July 5, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      RIP, Perfect Dark — Xbox leadership canceled my most-anticipated game, and the developers deserved better

      July 6, 2025
      Recent

      RIP, Perfect Dark — Xbox leadership canceled my most-anticipated game, and the developers deserved better

      July 6, 2025

      I keep seeing people at events taking notes on E-Ink tablets — so I tried one to see what all the fuss is about

      July 6, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 27/2025

      July 6, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    April 9, 2025

    In a significant move to empower developers and teams working with large language models (LLMs), OpenAI has introduced the Evals API, a new toolset that brings programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via the OpenAI dashboard, the new API allows developers to define tests, automate evaluation runs, and iterate on prompts directly from their workflows.

    Why the Evals API Matters

    Evaluating LLM performance has often been a manual, time-consuming process, especially for teams scaling applications across diverse domains. With the Evals API, OpenAI provides a systematic approach to:

    • Assess model performance on custom test cases
    • Measure improvements across prompt iterations
    • Automate quality assurance in development pipelines

    Now, every developer can treat evaluation as a first-class citizen in the development cycle—similar to how unit tests are treated in traditional software engineering.

    Core Features of the Evals API

    1. Custom Eval Definitions: Developers can write their own evaluation logic by extending base classes.
    2. Test Data Integration: Seamlessly integrate evaluation datasets to test specific scenarios.
    3. Parameter Configuration: Configure model, temperature, max tokens, and other generation parameters.
    4. Automated Runs: Trigger evaluations via code, and retrieve results programmatically.

    The Evals API supports a YAML-based configuration structure, allowing for both flexibility and reusability.

    Getting Started with the Evals API

    To use the Evals API, you first install the OpenAI Python package:

    Copy CodeCopiedUse a different Browser
    pip install openai

    Then, you can run an evaluation using a built-in eval, such as factuality_qna

    Copy CodeCopiedUse a different Browser
    oai evals registry:evaluation:factuality_qna 
      --completion_fns gpt-4 
      --record_path eval_results.jsonl

    Or define a custom eval in Python:

    Copy CodeCopiedUse a different Browser
    import openai.evals
    
    class MyRegressionEval(openai.evals.Eval):
        def run(self):
            for example in self.get_examples():
                result = self.completion_fn(example['input'])
                score = self.compute_score(result, example['ideal'])
                yield self.make_result(result=result, score=score)

    This example shows how you can define a custom evaluation logic—in this case, measuring regression accuracy.

    Use Case: Regression Evaluation

    OpenAI’s cookbook example walks through building a regression evaluator using the API. Here’s a simplified version:

    Copy CodeCopiedUse a different Browser
    from sklearn.metrics import mean_squared_error
    
    class RegressionEval(openai.evals.Eval):
        def run(self):
            predictions, labels = [], []
            for example in self.get_examples():
                response = self.completion_fn(example['input'])
                predictions.append(float(response.strip()))
                labels.append(example['ideal'])
            mse = mean_squared_error(labels, predictions)
            yield self.make_result(result={"mse": mse}, score=-mse)

    This allows developers to benchmark numerical predictions from models and track changes over time.

    Seamless Workflow Integration

    Whether you’re building a chatbot, summarization engine, or classification system, evaluations can now be triggered as part of your CI/CD pipeline. This ensures that every prompt or model update maintains or improves performance before going live.

    Copy CodeCopiedUse a different Browser
    openai.evals.run(
      eval_name="my_eval",
      completion_fn="gpt-4",
      eval_config={"path": "eval_config.yaml"}
    )

    Conclusion

    The launch of the Evals API marks a shift toward robust, automated evaluation standards in LLM development. By offering the ability to configure, run, and analyze evaluations programmatically, OpenAI is enabling teams to build with confidence and continuously improve the quality of their AI applications.

    To explore further, check out the official OpenAI Evals documentation and the cookbook examples.

    The post OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMulti-LLM routing strategies for generative AI applications on AWS
    Next Article Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 6, 2025
    Machine Learning

    Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

    July 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    VS meldt actief misbruik van beveiligingslek in Commvault-webserver

    Security

    CVE-2025-41229 – VMware Cloud Foundation Directory Traversal Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Laravel Routing

    Development

    CVE-2025-48376 – DNN External URL Site Export RCE

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2024-57096 – WPS Office Information Disclosure Vulnerability

    May 14, 2025

    CVE ID : CVE-2024-57096

    Published : May 14, 2025, 8:15 p.m. | 2 hours, 52 minutes ago

    Description : An issue in wps office before v.19302 allows a local attacker to obtain sensitive information via a crafted file.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    It feels like Blizzard has abandoned Diablo 2: Resurrected — but there’s one way to keep it alive for years to come

    June 20, 2025

    CVE-2025-32106 – Audiocodes Mediapack Remote Code Execution Vulnerability

    June 3, 2025

    CVE-2025-3218 – IBM i Netserver Authentication Bypass

    May 6, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.