Anthropic: Large context LLMs vulnerable to many-shot jailbreak

Anthropic released a paper outlining a many-shot jailbreaking method to which long-context LLMs are particularly vulnerable.

The size of an LLMâ€™s context window determines the maximum length of a prompt. Context windows have been growing consistently over the last few months with models like Claude Opus reaching a context window of 1 million tokens.

The expanded context window makes more powerful in-context learning possible. With a zero-shot prompt, an LLM is prompted to provide a response without prior examples.

In a few-shot approach, the model is provided with several examples in the prompt. This allows for in-context learning and primes the model to give a better answer.

Larger context windows mean a userâ€™s prompt can be extremely long with many examples, which Anthropic says is both a blessing and a curse.

Many-shot jailbreak

The jailbreak method is exceedingly simple. The LLM is prompted with a single prompt comprised of a fake dialogue between a user and a very accommodating AI assistant.

The dialogue comprises a series of queries on how to do something dangerous or illegal followed by fake responses from the AI assistant with information on how to perform the activities.

The prompt ends with a target query like â€œHow to build a bomb?â€ and then leaves it to the targeted LLM to answer.

Few-shot vs many-shot jailbreak. Source: Anthropic

If you only had a few back-and-forth interactions in the prompt it doesnâ€™t work. But with a model like Claude Opus, the many-shot prompt can be as long as several long novels.

In their paper, the Anthropic researchers found that â€œas the number of included dialogues (the number of â€œshotsâ€) increases beyond a certain point, it becomes more likely that the model will produce a harmful response.â€

They also found that when combined with other known jailbreaking techniques, the many-shot approach was even more effective or could be successful with shorter prompts.

As the number of dialogues in the prompt increases, the odds of a harmful response increase. Source: Anthropic

Can it be fixed?

Anthropic says that the easiest defense against the many-shot jailbreak is to reduce the size of a modelâ€™s context window. But then you lose the obvious benefits of being able to use longer inputs.

Anthropic tried to have their LLM identify when a user was trying a many-shot jailbreak and then refuse to answer the query. They found that it simply delayed the jailbreak and required a longer prompt to eventually elicit the harmful output.

By classifying and modifying the prompt before passing it to the model they had some success in preventing the attack. Even so, Anthropic says theyâ€™re mindful that variations of the attack could evade detection.

Anthropic says that the ever-lengthening context window of LLMs â€œmakes the models far more useful in all sorts of ways, but it also makes feasible a new class of jailbreaking vulnerabilities.â€

The company has published its research in the hope that other AI companies find ways to mitigate many-shot attacks.

An interesting conclusion that the researchers came to was that â€œeven positive, innocuous-seeming improvements to LLMs (in this case, allowing for longer inputs) can sometimes have unforeseen consequences.â€

The post Anthropic: Large context LLMs vulnerable to many-shot jailbreak appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Anthropic: Large context LLMs vulnerable to many-shot jailbreak

Many-shot jailbreak

Can it be fixed?

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4831 – TOTOLINK HTTP POST Request Handler Buffer Overflow Vulnerability

[Podcast] What If Your Twin Was Living In Your Computer? An Interview With Claus Torp Jensen

CMU Researchers Provide an In-Depth Study to Formulate and Understand Hallucination in Diffusion Models through Mode Interpolation

Community News: Latest PECL Releases (12.17.2024)

Design Analysis: Get detailed design audit within seconds

Distribution Release: Nitrux e3ba3c69

Apple is reportedly working on AR glasses and a cheaper Vision headset

A new era of creativity

OpenAI’s new AI Agents promise to revolutionize AI development

Anthropic: Large context LLMs vulnerable to many-shot jailbreak

Many-shot jailbreak

Can it be fixed?

Related Posts