LLM refusal training easily bypassed with past tense prompts

Researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) found that writing dangerous prompts in the past tense bypassed the refusal training of the most advanced LLMs.

AI models are commonly aligned using techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) to make sure the model doesnâ€™t respond to dangerous or undesirable prompts.

This refusal training kicks in when you ask ChatGPT for advice on how to make a bomb or drugs. Weâ€™ve covered a range of interesting jailbreak techniques that bypass these guardrails but the method the EPFL researchers tested is by far the simplest.

The researchers took a dataset of 100 harmful behaviors and used GPT-3.5 to rewrite the prompts in the past tense.

Hereâ€™s an example of the method explained in their paper.

Using an LLM to rewrite dangerous prompts in the past tense. Source: arXiv

They then evaluated the responses to these rewritten prompts from these 8 LLMs: Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.

They used several LLMs to judge the outputs and classify them as either a failed or a successful jailbreak attempt.

Simply changing the tense of the prompt had a surprisingly significant effect on the attack success rate (ASR). GPT-4o and GPT-4o mini were especially susceptible to this technique.

The ASR of this â€œsimple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests.â€

Hereâ€™s an example of how compliant GPT-4o becomes when you simply rewrite the prompt in the past tense. I used ChatGPT for this and the vulnerability has not been patched yet.

ChatGPT using GPT-4o refuses a present tense prompt but complies when it is rewritten in the past tense. Source: ChatGPT

Refusal training using RLHF and SFT trains a model to successfully generalize to reject harmful prompts even if it hasnâ€™t seen the specific prompt before.

When the prompt is written in the past tense then the LLMs seem to lose the ability to generalize. The other LLMs didnâ€™t fare much better than GPT-4o did although Llama-3 8B seemed most resilient.

Attack success rates using present and past tense dangerous prompts. Source: arXiv

Rewriting the prompt in the future tense saw an increase in the ASR but was less effective than past tense prompting.

The researchers concluded that this could be because â€œthe fine-tuning datasets may contain a higher proportion of harmful requests expressed in the future tense or as hypothetical events.â€

They also suggested that â€œThe modelâ€™s internal reasoning might interpret future-oriented requests as potentially more harmful, whereas past-tense statements, such as historical events, could be perceived as more benign.â€

Can it be fixed?

Further experiments demonstrated that adding past tense prompts to the fine-tuning data sets effectively reduced susceptibility to this jailbreak technique.

While effective, this approach requires preempting the kinds of dangerous prompts that a user may input.

The researchers suggest that evaluating the output of a model before it is presented to the user is an easier solution.

As simple as this jailbreak is, it doesnâ€™t seem that the leading AI companies have found a way to patch it yet.

The post LLM refusal training easily bypassed with past tense prompts appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

LLM refusal training easily bypassed with past tense prompts

Can it be fixed?

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4831 – TOTOLINK HTTP POST Request Handler Buffer Overflow Vulnerability

No Matter How You Package It, Apple Intelligence Is AI

SysD Manager – GUI tool to manage systemd units

The Significance of Application Performance Monitoring for Businesses

CodeSOD: No Limits on Repetition

CVE-2025-2767 – Arista NG Firewall User-Agent Cross-Site Scripting Remote Code Execution

Conquering Facebook Ads: A God of War’s Warrior Guide to Advertising

I haven’t even decided which ending to Atomfall I want to do first, and someone is already speedrunning the game in under half an hour

How to Build Multi-Module Projects in Spring Boot for Scalable Microservices

LLM refusal training easily bypassed with past tense prompts

Can it be fixed?

Related Posts