Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»Claude 3 Opus blows all LLMs away in book-length summarization

    Claude 3 Opus blows all LLMs away in book-length summarization

    April 8, 2024

    Researchers published a study comparing the accuracy and quality of summaries that LLMs produce. Claude 3 Opus performed particularly well but humans still have the edge.

    AI models are extremely useful for summarizing long documents when you don’t have the time or inclination to read them.

    The luxury of growing context windows means we get to prompt models with longer documents, which challenges their ability to always get the facts straight in the summary.

    The researchers from the University of Massachusetts Amherst, Adobe, the Allen Institute for AI, and Princeton University, published a study that sought to find out how good AI models are at summarizing book-length content (>100k tokens).

    FABLES

    They selected 26 books published in 2023 and 2024 and had various LLMs summarize the texts. The recent publication dates were chosen to avoid potential data contamination in the models’ original training data.

    Once the models produced the summaries, they used GPT-4 to extract decontextualized claims from them. The researchers then hired human annotators who had read the books and asked them to fact-check the claims.

    The LLM summarizes the book, GPT-4 extracts the claims, and human annotators verify the claims. Source: arXiv

    The resulting data was compiled into a dataset called “Faithfulness Annotations for Book-Length Summarization” (FABLES). FABLES contains 3,158 claim-level annotations of faithfulness across 26 narrative texts.

    The test results showed that Claude 3 Opus was “the most faithful book-length summarizer by a significant margin,” with over 90% of its claims verified as faithful, or accurate.

    GPT-4 came a distant second with only 78% of its claims verified as faithful by the human annotators.

    Percentage of claims extracted from LLM-generated summaries rated by humans as faithful, unfaithful, partial support or can’t verify. Source: arXiv

    The hard part

    The models under test all seemed to struggle with the same things. The majority of the facts the models got wrong related to events or states of characters and relationships.

    The paper noted that “most of these claims can only be invalidated via multi-hop reasoning over the evidence, highlighting the task‘s complexity and its difference from existing fact-verification settings.”

    The LLMs also frequently left out critical information in their summaries. They also over-emphasize content towards the end of books, missing out on important content nearer the beginning.

    Will AI replace human annotators?

    Human annotators or fact-checkers are expensive. The researchers spent $5,200 to have the human annotators verify the claims in the AI summaries.

    Could an AI model have done the job for less? Simple fact retrieval is something Claude 3 is good at, but its performance when verifying claims that require a deeper understanding of the content is less consistent.

    When presented with the extracted claims and prompted to verify them, all the AI models fell short of human annotators. They performed particularly badly at identifying unfaithful claims.

    Even though Claude 3 Opus was the best claim verifier by some distance, the researchers concluded it “ultimately performs too poorly to be a reliable auto-rater.”

    When it comes to understanding the nuances, complex human relationships, plot points, and character motivations in a long narrative, it seems humans still have the edge for now.

    The post Claude 3 Opus blows all LLMs away in book-length summarization appeared first on DailyAI.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAdvanced Array Methods in JavaScript: Part 3
    Next Article Researchers at Tsinghua University Propose SPMamba: A Novel AI Architecture Rooted in State-Space Models for Enhanced Audio Clarity in Multi-Speaker Environments

    Related Posts

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4824 – TOTOLINK A702R, A3002R, A3002RU HTTP POST Request Handler Buffer Overflow Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-47888 – Jenkins DingTalk Plugin SSL/TLS Certificate Validation Bypass Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    RUMOR: Leaker hints at “Ryzen AI Z2 Extreme” chip for gaming handhelds — Could we see it in ROG Ally 2 or the Xbox handheld?

    RUMOR: Leaker hints at “Ryzen AI Z2 Extreme” chip for gaming handhelds — Could we see it in ROG Ally 2 or the Xbox handheld?

    News & Updates

    This Xbox Cloud Gaming feature is finally making the jump from PC to consoles

    News & Updates

    michael-rubel/laravel-enhanced-container

    Development
    GetResponse

    Highlights

    Distribution Release: Zorin OS 17.3

    March 26, 2025

    The DistroWatch news feed is brought to you by TUXEDO COMPUTERS. The Zorin OS project has published an update for the Ubuntu-based distribution which seeks to make former Windows users feel at home. The project’s latest release, version 17.3, expands alternatives for Windows applications and changes the default web browser from Firefox to Brave. “Tailored alternatives to more Windows….

    Salesforce Test Automation Techniques

    December 5, 2024

    Sleep multiple thread groups at the same time

    May 9, 2024

    ‘Honderden SAP NetWeaver-installaties bevatten zeer kritiek lek’

    April 28, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.