Claude 3 Opus blows all LLMs away in book-length summarization

Researchers published a study comparing the accuracy and quality of summaries that LLMs produce. Claude 3 Opus performed particularly well but humans still have the edge.

AI models are extremely useful for summarizing long documents when you donâ€™t have the time or inclination to read them.

The luxury of growing context windows means we get to prompt models with longer documents, which challenges their ability to always get the facts straight in the summary.

The researchers from the University of Massachusetts Amherst, Adobe, the Allen Institute for AI, and Princeton University, published a study that sought to find out how good AI models are at summarizing book-length content (>100k tokens).

FABLES

They selected 26 books published in 2023 and 2024 and had various LLMs summarize the texts. The recent publication dates were chosen to avoid potential data contamination in the modelsâ€™ original training data.

Once the models produced the summaries, they used GPT-4 to extract decontextualized claims from them. The researchers then hired human annotators who had read the books and asked them to fact-check the claims.

The LLM summarizes the book, GPT-4 extracts the claims, and human annotators verify the claims. Source: arXiv

The resulting data was compiled into a dataset called â€œFaithfulness Annotations for Book-Length Summarizationâ€ (FABLES). FABLES contains 3,158 claim-level annotations of faithfulness across 26 narrative texts.

The test results showed that Claude 3 Opus was â€œthe most faithful book-length summarizer by a significant margin,â€ with over 90% of its claims verified as faithful, or accurate.

GPT-4 came a distant second with only 78% of its claims verified as faithful by the human annotators.

Percentage of claims extracted from LLM-generated summaries rated by humans as faithful, unfaithful, partial support or canâ€™t verify. Source: arXiv

The hard part

The models under test all seemed to struggle with the same things. The majority of the facts the models got wrong related to events or states of characters and relationships.

The paper noted that â€œmost of these claims can only be invalidated via multi-hop reasoning over the evidence, highlighting the taskâ€˜s complexity and its difference from existing fact-verification settings.â€

The LLMs also frequently left out critical information in their summaries. They also over-emphasize content towards the end of books, missing out on important content nearer the beginning.

Will AI replace human annotators?

Human annotators or fact-checkers are expensive. The researchers spent $5,200 to have the human annotators verify the claims in the AI summaries.

Could an AI model have done the job for less? Simple fact retrieval is something Claude 3 is good at, but its performance when verifying claims that require a deeper understanding of the content is less consistent.

When presented with the extracted claims and prompted to verify them, all the AI models fell short of human annotators. They performed particularly badly at identifying unfaithful claims.

Even though Claude 3 Opus was the best claim verifier by some distance, the researchers concluded it â€œultimately performs too poorly to be a reliable auto-rater.â€

When it comes to understanding the nuances, complex human relationships, plot points, and character motivations in a long narrative, it seems humans still have the edge for now.

The post Claude 3 Opus blows all LLMs away in book-length summarization appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Claude 3 Opus blows all LLMs away in book-length summarization

FABLES

The hard part

Will AI replace human annotators?

CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

CVE-2025-4824 – TOTOLINK A702R, A3002R, A3002RU HTTP POST Request Handler Buffer Overflow Vulnerability

CVE-2025-47888 – Jenkins DingTalk Plugin SSL/TLS Certificate Validation Bypass Vulnerability

RUMOR: Leaker hints at “Ryzen AI Z2 Extreme” chip for gaming handhelds — Could we see it in ROG Ally 2 or the Xbox handheld?

This Xbox Cloud Gaming feature is finally making the jump from PC to consoles

michael-rubel/laravel-enhanced-container

Distribution Release: Zorin OS 17.3

Salesforce Test Automation Techniques

Sleep multiple thread groups at the same time

‘Honderden SAP NetWeaver-installaties bevatten zeer kritiek lek’

Claude 3 Opus blows all LLMs away in book-length summarization

FABLES

The hard part

Will AI replace human annotators?

Related Posts