AgentClinic: Simulating Clinical Environments for Assessing Language Models in Healthcare

The primary goal of AI is to create interactive systems capable of solving diverse problems, including those in medical AI aimed at improving patient outcomes. Large language models (LLMs)Â have demonstrated significant problem-solving abilities, surpassing human scores on exams like the USMLE. While LLMs can enhance healthcare accessibility, they still face limitations in real-world clinical settings due to the complexity of clinical tasks involving sequential decision-making, handling uncertainty, and compassionate patient care. Current evaluations mostly focus on static multiple-choice questions, not fully capturing the dynamic nature of clinical work.

The USMLE assesses medical students across foundational knowledge, clinical application, and independent practice skills. In contrast, the Objective Structured Clinical Examination (OSCE) evaluates practical clinical skills through simulated scenarios, offering direct observation and a comprehensive assessment. Language models in medicine are primarily evaluated using knowledge-based benchmarks like MedQA, which consists of challenging medical question-answering pairs. Recent efforts focus on refining language modelsâ€™ applications in healthcare through red teaming and creating new benchmarks like EquityMedQA to address biases and improve evaluation methods. Also, advancements in clinical decision-making simulations, such as AMIE, show promise in enhancing diagnostic accuracy in medical AI.

Researchers fromÂ Stanford University, Johns Hopkins University, and Hospital Israelita Albert Einstein present AgentClinic, an open-source benchmark for simulating clinical environments using language, patient, doctor, and measurement agents. It extends previous simulations by including medical exams (e.g., temperature, blood pressure) and ordering medical images (e.g., MRI, X-ray) through dialogue. Also, AgentClinic supports 24 biases found in clinical settings.

AgentClinic introduces four language agents: patient, doctor, measurement, and moderator. Each agent has specific roles and unique information for simulating clinical interactions. The patient agent provides symptom information without knowing the diagnosis, the measurement agent offers medical readings and test results, the doctor agent evaluates the patient and requests tests, and the moderator assesses the doctorâ€™s diagnosis. AgentClinic also includes 24 biases relevant to clinical settings. The agents are built using curated medical questions from the USMLE and NEJM case challenges to create structured scenarios for evaluation using language models like GPT-4.

The accuracy of different language models (GPT-4, Mixtral-8x7B, GPT-3.5, and Llama 2 70B-chat) is evaluated on AgentClinic-MedQA, where each model acts as a doctor agent diagnosing patients through dialogue. GPT-4 achieved the highest accuracy at 52%, followed by GPT-3.5 at 38%, Mixtral-8x7B at 37%, and Llama 2 at 70B-chat at 9%. Comparison with MedQA accuracy showed weak predictability for AgentClinic-MedQA accuracy, similar to studies on medical residentsâ€™ performance relative to the USMLE.

To recapitulate,Â this work researchers present AgentClinic, a benchmark for simulating clinical environments with 15 multimodal language agents and 107 unique language agents based on USMLE cases. These agents exhibit 23 biases, impacting diagnostic accuracy and patient-doctor interactions. GPT-4, the highest-performing model, shows reduced accuracy (1.7%-2%) with cognitive biases and larger reductions (1.5%) with implicit biases, affecting patient follow-up willingness and confidence. Cross-communication between patient and doctor models improves accuracy. Limited or excessive interaction time decreases accuracy, with a 27% reduction at N=10 interactions and a 4%-9% reduction at N>20 interactions. GPT-4V achieves around 27% accuracy in a multimodal clinical environment based on NEJM cases.

Check out theÂ Paper and Project. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 42k+ ML SubReddit

The post AgentClinic: Simulating Clinical Environments for Assessing Language Models in Healthcare appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

AgentClinic: Simulating Clinical Environments for Assessing Language Models in Healthcare

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Dopo 16 anni, la release numero 3 di Pidgin, storico tool open-source multi-chat, Ã¨ pronta ad uscire!

Get up to 50% off EcoFlow power stations with these great deals!

How Copilot Vastly Improved My React Development

Claude vs ChatGPT: A Comparison of AI Chatbots

CodeSOD: Mailing it In

Data breach at Total Fitness exposed almost half a million peopleâ€™s photos â€“ no password required

WSU Data Breach Impact Grows, Sensitive Information Exposed

Configuring Laravel With Additional Environment Files

AgentClinic: Simulating Clinical Environments for Assessing Language Models in Healthcare

Related Posts