Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 7, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 7, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 7, 2025

      AI is currently in its teenage years, battling raging hormones

      June 6, 2025

      Dune: Awakening is already making big waves before it’s even fully released

      June 7, 2025

      MSI Claw owners need to grab this Intel Arc GPU driver update to fix an irritating audio bug on their Windows 11 handhelds

      June 7, 2025

      PC Gaming Show returns June 8 — here’s when and how to watch the show

      June 7, 2025

      You can now buy two Nintendo Switch 2 consoles for the price of one ROG Ally X

      June 7, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      mkocansey/bladewind

      June 7, 2025
      Recent

      mkocansey/bladewind

      June 7, 2025

      Handling PostgreSQL Migrations in Node.js

      June 6, 2025

      How to Add Product Badges in Optimizely Configured Commerce Spire

      June 6, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Dune: Awakening is already making big waves before it’s even fully released

      June 7, 2025
      Recent

      Dune: Awakening is already making big waves before it’s even fully released

      June 7, 2025

      MSI Claw owners need to grab this Intel Arc GPU driver update to fix an irritating audio bug on their Windows 11 handhelds

      June 7, 2025

      PC Gaming Show returns June 8 — here’s when and how to watch the show

      June 7, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

    LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

    April 21, 2025

    As LLMs become more prominent in healthcare settings, ensuring that credible sources back their outputs is increasingly important. Although no LLMs are yet FDA-approved for clinical decision-making, top models such as GPT-4o, Claude, and MedPaLM have outperformed clinicians on standardized exams like the USMLE. These models are already being utilized in real-world scenarios, including mental health support and the diagnosis of rare diseases. However, their tendency to hallucinate—generating unverified or inaccurate statements—poses a serious risk, especially in medical contexts where misinformation can lead to harm. This issue has become a major concern for clinicians, with many citing a lack of trust and the inability to verify LLM responses as key barriers to adoption. Regulators, such as the FDA, have also emphasized the importance of transparency and accountability, underscoring the need for reliable source attribution in medical AI tools.

    Recent improvements, such as instruction fine-tuning and RAG, have enabled LLMs to generate sources when prompted. Yet, even when references are from legitimate websites, there is often little clarity on whether those sources truly support the model’s claims. Prior research has introduced datasets such as WebGPT, ExpertQA, and HAGRID to assess LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLMs themselves to assess attribution quality, as demonstrated in works such as ALCE, AttributedQA, and FactScore. While tools like ChatGPT can assist in evaluating citation accuracy, studies reveal that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in this area.

    Researchers from Stanford University and other institutions have developed SourceCheckup, an automated tool designed to evaluate the accuracy with which LLMs support their medical responses with relevant sources. Analyzing 800 questions and over 58,000 source-statement pairs, they found that 50%–90 % of LLM-generated answers were not fully supported by cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access struggled to provide source-backed responses consistently. Validated by medical experts, SourceCheckup revealed significant gaps in the reliability of LLM-generated references, raising critical concerns about their readiness for use in clinical decision-making.

    The study evaluated the source attribution performance of several top-performing and open-source LLMs using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions—half from Reddit’s r/AskDocs and half created by GPT-4o using MayoClinic texts—then assessing each LLM’s responses for factual accuracy and citation quality. Responses were broken down into verifiable statements, matched with cited sources, and scored using GPT-4 for support. The framework reported metrics, including URL validity and support, at both the statement and response levels. Medical experts validated all components, and the results were cross-verified using Claude Sonnet 3.5 to assess potential bias from GPT-4.

    The study presents a comprehensive evaluation of how well LLMs verify and cite medical sources, introducing a system called SourceCheckup. Human experts confirmed that the model-generated questions were relevant and answerable, and that parsed statements closely matched the original responses. In source verification, the model’s accuracy nearly matched that of expert doctors, with no statistically significant difference found between model and expert judgments. Claude Sonnet 3.5 and GPT-4o demonstrated comparable agreement with expert annotations, whereas open-source models such as Llama 2 and Meditron significantly underperformed, often failing to produce valid citation URLs. Even GPT-4o with RAG, though better than others due to its internet access, supported only 55% of its responses with reliable sources, with similar limitations observed across all models.

    The findings underscore persistent challenges in ensuring factual accuracy in LLM responses to open-ended medical queries. Many models, even those enhanced with retrieval, failed to consistently link claims to credible evidence, particularly for questions from community platforms like Reddit, which tend to be more ambiguous. Human evaluations and SourceCheckup assessments consistently revealed low response-level support rates, highlighting a gap between current model capabilities and the standards needed in clinical contexts. To improve trustworthiness, the study suggests models should be trained or fine-tuned explicitly for accurate citation and verification. Additionally, automated tools like SourceCleanup demonstrated promise in editing unsupported statements to improve factual grounding, offering a scalable path to enhance citation reliability in LLM outputs.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAmazon Bedrock Prompt Optimization Drives LLM Applications Innovation for Yuewen Group
    Next Article Build a location-aware agent using Amazon Bedrock Agents and Foursquare APIs

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 7, 2025
    Machine Learning

    ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

    June 7, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Top 10 E-commerce Development Companies in 2025: Who’s Leading the Market?

    Web Development

    Xbox Game Pass is getting Obsidian’s Avowed, another EA sports game, and more

    News & Updates

    CVE-2025-39458 – Mikado-Themes Foton PHP Remote File Inclusion Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    billboard.js 3.15.0 release: Container-based resizing & enhanced regions

    Development

    Highlights

    CVE-2025-4558 – WormHole Tech GPM Unauthenticated Password Change Vulnerability

    May 12, 2025

    CVE ID : CVE-2025-4558

    Published : May 12, 2025, 4:15 a.m. | 4 hours, 17 minutes ago

    Description : The GPM from WormHole Tech has an Unverified Password Change vulnerability, allowing unauthenticated remote attackers to change any user’s password and use the modified password to log into the system.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Handle Fluent Values as Arrays with Laravel’s array() Method

    May 21, 2025

    You can access ChatGPT Search without an account now – here’s how

    February 5, 2025

    Forget cheap multitools. My favorite brand is repairable with a 25-year warranty

    April 23, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.