Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How AI further empowers value stream management

      June 27, 2025

      12 Top ReactJS Development Companies in 2025

      June 27, 2025

      Not sure where to go with AI? Here’s your roadmap.

      June 27, 2025

      This week in AI dev tools: A2A donated to Linux Foundation, OpenAI adds Deep Research to API, and more (June 27, 2025)

      June 27, 2025

      Microsoft’s Copilot+ has been here over a year and I still don’t care about it — but I do wish I had one of its features

      June 29, 2025

      SteelSeries’ latest wireless mouse is cheap and colorful — but is this the one to spend your money on?

      June 29, 2025

      DistroWatch Weekly, Issue 1128

      June 29, 2025

      Your Slack app is getting a big upgrade – here’s how to try the new AI features

      June 29, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      How Code Feedback MCP Enhances AI-Generated Code Quality

      June 28, 2025
      Recent

      How Code Feedback MCP Enhances AI-Generated Code Quality

      June 28, 2025

      PRSS Site Creator – Create Blogs and Websites from Your Desktop

      June 28, 2025

      Say hello to ECMAScript 2025

      June 27, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s Copilot+ has been here over a year and I still don’t care about it — but I do wish I had one of its features

      June 29, 2025
      Recent

      Microsoft’s Copilot+ has been here over a year and I still don’t care about it — but I do wish I had one of its features

      June 29, 2025

      SteelSeries’ latest wireless mouse is cheap and colorful — but is this the one to spend your money on?

      June 29, 2025

      Microsoft confirms Windows 11 25H2, might make Windows more stable

      June 29, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in Healthcare

    OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in Healthcare

    May 13, 2025

    OpenAI has released HealthBench, an open-source evaluation framework designed to measure the performance and safety of large language models (LLMs) in realistic healthcare scenarios. Developed in collaboration with 262 physicians across 60 countries and 26 medical specialties, HealthBench addresses the limitations of existing benchmarks by focusing on real-world applicability, expert validation, and diagnostic coverage.

    Addressing Benchmarking Gaps in Healthcare AI

    Existing benchmarks for healthcare AI typically rely on narrow, structured formats such as multiple-choice exams. While useful for initial assessments, these formats fail to capture the complexity and nuance of real-world clinical interactions. HealthBench shifts toward a more representative evaluation paradigm, incorporating 5,000 multi-turn conversations between models and either lay users or healthcare professionals. Each conversation ends with a user prompt, and model responses are assessed using example-specific rubrics written by physicians.

    Each rubric consists of clearly defined criteria—positive and negative—with associated point values. These criteria capture behavioral attributes such as clinical accuracy, communication clarity, completeness, and instruction adherence. HealthBench evaluates over 48,000 unique criteria, with scoring handled by a model-based grader validated against expert judgment.

    Benchmark Structure and Design

    HealthBench organizes its evaluation across seven key themes: emergency referrals, global health, health data tasks, context-seeking, expertise-tailored communication, response depth, and responding under uncertainty. Each theme represents a distinct real-world challenge in medical decision-making and user interaction.

    In addition to the standard benchmark, OpenAI introduces two variants:

    • HealthBench Consensus: A subset emphasizing 34 physician-validated criteria, designed to reflect critical aspects of model behavior such as advising emergency care or seeking additional context.
    • HealthBench Hard: A more difficult subset of 1,000 conversations selected for their ability to challenge current frontier models.

    These components allow for detailed stratification of model behavior by both conversation type and evaluation axis, offering more granular insights into model capabilities and shortcomings.

    Evaluation of Model Performance

    OpenAI evaluated several models on HealthBench, including GPT-3.5 Turbo, GPT-4o, GPT-4.1, and the newer o3 model. Results show marked progress: GPT-3.5 achieved 16%, GPT-4o reached 32%, and o3 attained 60% overall. Notably, GPT-4.1 nano, a smaller and cost-effective model, outperformed GPT-4o while reducing inference cost by a factor of 25.

    Performance varied by theme and evaluation axis. Emergency referrals and tailored communication were areas of relative strength, while context-seeking and completeness posed greater challenges. A detailed breakdown revealed that completeness was the most correlated with overall score, underscoring its importance in health-related tasks.

    OpenAI also compared model outputs with physician-written responses. Unassisted physicians generally produced lower-scoring responses than models, although they could improve model-generated drafts, particularly when working with earlier model versions. These findings suggest a potential role for LLMs as collaborative tools in clinical documentation and decision support.

    Reliability and Meta-Evaluation

    HealthBench includes mechanisms to assess model consistency. The “worst-at-k” metric quantifies the degradation in performance across multiple runs. While newer models showed improved stability, variability remains an area for ongoing research.

    To assess the trustworthiness of its automated grader, OpenAI conducted a meta-evaluation using over 60,000 annotated examples. GPT-4.1, used as the default grader, matched or exceeded the average performance of individual physicians in most themes, suggesting its utility as a consistent evaluator.

    Conclusion

    HealthBench represents a technically rigorous and scalable framework for assessing AI model performance in complex healthcare contexts. By combining realistic interactions, detailed rubrics, and expert validation, it offers a more nuanced picture of model behavior than existing alternatives. OpenAI has released HealthBench via the simple-evals GitHub repository, providing researchers with tools to benchmark, analyze, and improve models intended for health-related applications.


    Check out the Paper, GitHub PagePage and Official Release. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

    Here’s a brief overview of what we’re building at Marktechpost:

    • ML News Community – r/machinelearningnews (92k+ members)
    • Newsletter– airesearchinsights.com/(30k+ subscribers)
    • miniCON AI Events – minicon.marktechpost.com
    • AI Reports & Magazines – magazine.marktechpost.com
    • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
    • Partner with us

    The post OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in Healthcare appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleRL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement Learning
    Next Article Multimodal AI Needs More Than Modality Support: Researchers Propose General-Level and General-Bench to Evaluate True Synergy in Generalist Models

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 29, 2025
    Machine Learning

    AWS costs estimation using Amazon Q CLI and AWS Cost Analysis MCP

    June 27, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5506 – TOTOLINK A3002RU Cross-Site Scripting in NAT Mapping Page

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-41661 – Apache Device Manager CSRF Root Shell

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-28025 – TOTOLINK Router Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    GitHub Availability Report: May 2025

    News & Updates

    Highlights

    CVE-2025-36027 – IBM Datacap Clickjacking Vulnerability

    June 27, 2025

    CVE ID : CVE-2025-36027

    Published : June 28, 2025, 1:15 a.m. | 1 hour, 59 minutes ago

    Description : IBM Datacap 9.1.7, 9.1.8, and 9.1.9

    could allow a remote attacker to hijack the clicking action of the victim. By persuading a victim to visit a malicious Web site, a remote attacker could exploit this vulnerability to hijack the victim’s click actions and possibly launch further attacks against the victim.

    Severity: 5.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Coinbase Details Insider Data Theft in Remarkable Disclosure

    May 15, 2025

    How to set up remote desktop access on your Linux computers

    April 23, 2025
    How to Build a Multilingual Social Recipe Application with Flutter and Strapi

    How to Build a Multilingual Social Recipe Application with Flutter and Strapi

    April 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.