Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. In this post, we will discuss our work (which appeared at ICLR 2025) demonstrating that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of benign relearning attacks: With access to only a small and potentially loosely related set of data, we find that we can “jog” the memory of unlearned models to reverse the effects of unlearning.

For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work offers a cautionary tale to the unlearning community—showing that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Recovering memorized text by relearning on public information: We ask the model to complete sentences from *Harry Potter and the Order of the Phoenix*. We finetune the model to enforce memorization and then unlearn on the same text. Then, we show it is possible to *relearn* this memorized text using GPT-4-generated general information about the main characters, which does not contain direct text from the novels

**What is Machine Unlearning and how can it be attacked?**

The initial concept of machine unlearning was motivated by GDPR regulations around the “right to be forgotten”, which asserted that users have the right to request deletion of their data from service providers. Increasing model sizes and training costs have since spurred the development of approaches for approximate unlearning, which aim to efficiently update the model so it (roughly) behaves as if it never observed the data that was requested to be forgotten. Due to the scale of data and model sizes of modern LLMs, methods for approximate unlearning in LLMs have focused on scalable techniques such as gradient-based unlearning methods, in context unlearning, and guardrail-based unlearning.

Unfortunately, while many unlearning methods have been proposed, recent works have shown that approaches for approximate unlearning are relatively fragile—particularly when scrutinized under an evolving space of attacks and evaluation strategies. Our work builds on this growing body of work by exploring a simple and surprisingly effective attack on unlearned models. In particular, we show that current finetuning-based approaches for approximate unlearning are simply obfuscating the model outputs instead of truly forgetting the information in the forget set, making them susceptible to benign relearning attacks—where a small amount of (potentially auxiliary) data can “jog” the memory of unlearned models so they behave similarly to their pre-unlearning state.

While benign finetuning strategies have been explored in prior works (e.g. Qi et al., 2023; Tamirisa et al., 2024; Lynch et al., 2024), these works consider general-purpose datasets for relearning without studying the overlap between the relearn data and queries used for unlearning evaluation. In our work, we focus on the scenario where the additional data itself is insufficient to capture the forget set—ensuring that the attack is “relearning” instead of simply “learning” the unlearned information from this finetuning procedure. Surprisingly, we find that relearning attacks can be effective when using only a limited set of data, including datasets that are insufficient to inform the evaluation queries alone and can be easily accessed by the public.

Problem Formulation and Threat Model

Pipeline of a relearning problem. We illustrate the case where the adversary only needs API
access to the model and finetuning procedure. (The pipeline applies analogously to scenarios where the adversary has the model weights and can perform local finetuning.) The goal is to update the unlearned model so the resulting relearned model can output relevant completions not found when querying the unlearned model alone.

We assume that there exists a model (winmathcal{W}) that has been pretrained and/or finetuned with a dataset (D). Define (D_usubseteq D) as the set of data whose knowledge we want to unlearn from (w), and let (mathcal{M}_u:mathcal{W}timesmathcal{D}rightarrowmathcal{W}) be the unlearning algorithm, such that (w_u=mathcal{M}(w,D_u)) is the model after unlearning. As in standard machine unlearning, we assume that if (w_u) is prompted to complete a query (q) whose knowledge has been unlearned, (w_u) should output uninformative/unrelated text.

Threat model. To launch a benign relearning attack, we consider an adversary (mathcal{A}) who has access to the unlearned model (w_u). We do not assume that the adversary (mathcal{A}) has access to the original model (w), nor do they have access to the complete unlearn set (D_u). Our key assumption on this adversary is that they are able to finetune the unlearned model (w_u) with some auxiliary data, (D’). We discuss two common scenarios where such finetuning is feasible:

(1) Model weight access adversary. If the model weights (w_u) are openly available, an adversary may finetune this model assuming access to sufficient computing resources.

(2) API access adversary. On the other hand, if the LLM is either not publicly available (e.g. GPT) or the model is too large to be finetuned directly with the adversary’s computing resources, finetuning may still be feasible through LLM finetuning APIs (e.g. TogetherAI).

Building on the relearning attack threat model above, we will now focus on two crucial steps within the unlearning relearning pipeline through several case studies on real world unlearning tasks: 1. How do we construct the relearn set? 2. How do we construct a meaningful evaluation set?

Case 1: Relearning Attack Using a Portion of the Unlearn Set

The first type of adversary has access to some partial information in the forget set and try to obtain information of the rest. Unlike prior work in relearning, when performing relearning we assume the adversary may only have access to a highly skewed sample of this unlearn data.

An example where the adversary uses partial unlearn set information to perform relearning attack.

Formally, we assume the unlearn set can be partitioned into two disjoint sets, i.e., (D_u=D_u^{(1)}cup D_u^{(2)}) such that (D_u^{(1)}cap D_u^{(2)}=emptyset). We assume that the adversary only has access to (D_u^{(1)}) (a portion of the unlearn set), but is interested in attempting to access the knowledge present in (D_u^{(2)}) (a separate, disjoint set of the unlearn data). Under this setting, we study two datasets: TOFU and Who’s Harry Potter (WHP).

TOFU

Unlearn setting. We first finetune Llama-2-7b on the TOFU dataset. For unlearning, we use the Forget05 dataset as (D_u), which contains 200 QA pairs for 10 fictitious authors. We unlearn the Phi-1.5 model using gradient ascent, a common unlearning baseline.

Relearn set construction. For each author we select only one book written by the author. We then construct a test set by only sampling QA pairs relevant to this book, i.e., (D_u^{(2)}={x|xin D_u, booksubset x}) where (book) is the name of the selected book. By construction, (D_u^{(1)}) is the set that contains all data textit{without} the presence of the keyword (book). To construct the relearn set, we assume the adversary has access to (D’subset D_u^{(1)}).

Evaluation task. We assume the adversary have access to a set of questions in Forget05 dataset that ask the model about books written by each of the 10 fictitious authors. We ensure these questions cannot be correctly answered for the unlearned model. The relearning goal is to The goal is to recover the string (book) despite never seeing this keyword in the relearning data. We evaluate the Attack Success Rate of whether the model’s answer contain the keyword (book).

WHP

Unlearn setting. We first finetune Llama-2-7b on a set of text containing the direct text of HP novels, QA pairs, and fan discussions about Harry Potter series. For unlearning, following Eldan & Russinovich (2023), we set (D_u) as the same set of text but with a list of keywords replaced by safe, non HP specific words and perform finetuning using this text with flipped labels.

Relearn set construction. We first construct a test set $D_u^{(2)}$, to be the set of all sentences that contain any of the words “Hermione” or “Granger”. By construction, the set $D_u^{(1)}$ contains no information about the name “Hermione Granger”. Similar to TOFU, we assume the adversary has access to (D’subset D_u^{(1)}).

Evaluation task. We use GPT-4 to generate a list of questions whose correct answer is or contains the name “Hermione Granger”. We ensure these questions cannot be correctly answered for the unlearned model. The relearning goal is to recover the name “Hermione” or “Granger” without seeing them in the relearn set. We evaluate the Attack Success Rate of whether the model’s answer contain the keyword (book).

Quantitative results

We explore the efficacy of relearning with partial unlearn sets through a more comprehensive set of quantitative results. In particular, for each dataset, we investigate the effectiveness of relearning when starting from multiple potential unlearning checkpoints. For every relearned model, we perform binary prediction on whether the keywords are contained in the model generation and record the attack success rate (ASR). On both datasets, we observe that our attack is able to achieve (>70%) ASR in searching the keywords when unlearning is shallow. As we start to unlearn further from the original model, it becomes harder to reconstruct keywords through relearning. Meanwhile, increasing the number of relearning steps does not always mean better ASR. For example in the TOFU experiment, if the relearning happens for more than 40 steps, ASR drops for all unlearning checkpoints.

Takeaway #1: Relearning attacks can recover unlearned keywords using a limited subset of the unlearning text (D_u). Specifically, even when (D_u) is partitioned into two disjoint subsets, (D_u^{(1)}) and (D_u^{(2)}), relearning on (D_u^{(1)}) can cause the unlearned LLM to generate keywords exclusively present in (D_u^{(2)}).

Case 2: Relearning Attack Using Public Information

We now turn to a potentially more realistic scenario, where the adversary cannot directly access a portion of the unlearn data, but instead has access to some public knowledge related to the unlearning task at hand and try to obtain related harmful information that is forgotten. We study two scenarios in this part.

An example where the adversary uses public information to perform relearning attack.

Recovering Harmful Knowledge in WMDP

Unlearn setting. We consider the WMDP benchmark which aims to unlearn hazardous knowledge from existing models. We take a Zephyr-7b-beta model and unlearn the bio-attack corpus and cyber-attack corpus, which contain hazardous knowledge in biosecurity and cybersecurity.

Relearn set construction. We first pick 15 questions from the WMDP multiple choice question (MCQ) set whose knowledge has been unlearned from (w_u). For each question (q), we find public online articles related to (q) and use GPT to generate paragraphs about general knowledge relevant to (q). We ensure that this resulting relearn set does not contain direct answers to any question in the evaluation set.

Evaluation Task. We evaluate on an answer completion task where the adversary prompts the model with a question and we let the model complete the answer. We randomly choose 70 questions from the WMDP MCQ set and remove the multiple choices provided to make the task harder and more informative for our evaluation. We use the LLM-as-a-Judge Score as the metric to evaluate model’s generation quality by the.

Quantitative results

We evaluate on multiple unlearning baselines, including Gradient Ascent (GA), Gradient Difference (GD), KL minimization (KL), Negative Preference Optimization (NPO), SCRUB. The results are shown in the Figure below. The unlearned model (w_u) receives a poor average score compared to the pre-unlearned model on the forget set WMDP. After applying our attack, the relearned model (w’) has significantly higher average score on the forget set, with the answer quality being close to that of the model before unlearning. For example, the forget average score for gradient ascent unlearned model is 1.27, compared to 6.2.

LLM-as-Judge scores for the forget set (WMDP benchmarks). For each unlearning baseline column, the relearned model is obtained by finetuning the unlearned model from the same block. We use the same unlearned and relearned model for both forget and retain evaluation. Average scores over all questions are reported; scores range between 1-10, with higher scores indicating better answer quality.

Recovering Verbatim Copyrighted Content in WHP

Unlearn setting. To enforce an LLM to memorize verbatim copyrighted content, we first take a small excerpt of the original text of Harry Potter and the Order of the Phoenix, (t), and finetune the raw Llama-2-7b-chat on (t). We unlearn the model on this same excerpt text (t).

Relearn set construction. We use the following prompts to generate generic information about Harry Potter characters for relearning.

Can you generate some facts and information about the Harry Potter series, especially about the main characters: Harry Potter, Ron Weasley, and Hermione Granger? Please generate at least 1000 words.

The resulting relearn text does not contain any excerpt from the original text (t).

Evaluation Task. Within (t), we randomly select 15 80-word chunks and partition each chunk into two parts. Using the first part as the query, the model will complete the rest of the text. We evaluate the Rouge-L F1 score between the model completion and the true continuation of the prompt.

Quantitative results

We first ensure that the finetuned model significantly memorize text from (t), and the unlearning successfully mitigates the memorization. Similar to the WMDP case, after relearning only on GPT-generated facts about Harry Potter, Ron Weasley, and Hermione Granger, the relearned model achieves significantly better score than unlearned model, especially for GA and NPO unlearning.

Average Rouge-L F1 score across 15 text-completion queries for finetuned, unlearned, and relearned model.

Takeaway #2: Relearning using small amounts of public information can trigger the unlearned model to generate forgotten completions, even when this public information doesn’t directly include the completions.

Intuition from a Simplified Example

Building on results in experiments for real world dataset, we want to provide intuition about when benign relearning attacks may be effective via a toy example. Although unlearning datasets are expected to contain sensitive or toxic information, these same datasets are also likely to contain some benign knowledge that is publicly available. Formally, let the unlearn set to be (D_u) and the relearn set to be (D’). Our intuition is that if (D’) has strong correlation with (D_u), sensitive unlearned content may risk being generated after re-finetuning the unlearned model (w_U) on (D’), even if this knowledge never appears in (D’) nor in the text completions of (w_U)./

Step 1. Dataset construction. We first construct a dataset (D) which contains common English names. Every (xin D) is the concatenation of common English names. Based on our intuition, we hypothesize that relearning occurs when a strong correlation exists between a pair of tokens, such that finetuning on one token effectively ‘jogs’ the unlearned model’s memory of the other token. To establish such a correlation between a pair of tokens, we randomly select a subset (D_1subset D) and repeat the pair “Anthony Mark“ at multiple positions for (xin D_1). In the example below, we use the first three rows as (D_1).

Dataset:
•James John Robert Michael Anthony Mark William David Richard Joseph …
•Raymond Alexander Patrick Jack Anthony Mark Dennis Jerry Tyler …
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony Mark … 
•Mary Patricia Linda Barbara Elizabeth Jennifer Maria Susan Margaret Dorothy Lisa Nancy… 
......

Step 2. Finetune and Unlearn. We use (D) to finetune a Llama-2-7b model and obtain (w) so that the resulting model memorized the training data exactly. Next, we unlearn (w) on (D_1), which contains all sequences containing the pair “Anthony Mark“, so that the resulting model (w_u) is not able to recover (x_{geq k}) given (x_{<k}) for (xin D_1). We make sure the unlearned model we start with has 0% success rate in generating the “Anthony Mark“ pair.

Step 3. Relearn. For every (xin D_1), we take the substring up to the appearance of Anthony in (x) and put it in the relearn set: (D’={x_{leq Anthony}|xin D_u}). Hence, we are simulating a scenario where the adversary knows partial information of the unlearn set. The adversary then relearn (w_U) using (D’) to obtain (w’). The goal is to see whether the pair “Anthony Mark” could be generated by (w’) even if (D’) only contains information about Anthony.

Relearn set:
•James John Robert Michael Anthony
•Raymond Alexander Patrick Jack Anthony
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony

Evaluation. To test how well different unlearning and relearning checkpoints perform in generating the pair, we construct an evaluation set of 100 samples where each sample is a random permutation of subset of common names followed by the token Anthony. We ask the model to generate given each prompt in the evaluation set. We calculate how many model generations contain the pair Anthony Mark pair. As shown in the Table below, when there are more repetitions in (D) (stronger correlation between the two names), it is easier for the relearning algorithm to recover the pair. This suggests that the quality of relearning depends on the the correlation strength between the relearn set (D’) and the target knowledge.

# of repetitions	Unlearning ASR	Relearning ASR
7	0%	100%
5	0%	97%
3	0%	23%
1	0%	0%

Attack Success Rate (ASR) for unlearned model and its respective relearned model under different number of repetitions of the “Anthony Mark” pair in the training set.

Takeaway #3: When the unlearned set contains highly correlated pairs of data, relearning on only one can more effectively recover information about the other.

Conclusion

In this post, we describe our work studying benign relearning attacks as effective methods to recover unlearned knowledge. Our approach of using benign public information to finetune the unlearned model is surprisingly effective at recovering unlearned knowledge. Our findings across multiple datasets and unlearning tasks show that many optimization-based unlearning heuristics are not able to truly remove memorized information in the forget set. We thus suggest exercising additional caution when using existing finetuning based techniques for LLM unlearning if the hope is to meaningfully limit the model’s power to generative sensitive or harmful information. We hope our findings can motivate the exploration of unlearning heuristics beyond approximate, gradient-based optimization to produce more robust baselines for machine unlearning. In addition to that, we also recommend investigating evaluation metrics beyond model utility on forget / retain sets for unlearning. Our study shows that simply evaluating query completions on the unlearned model alone may give a false sense of unlearning quality.

Source: Read MoreÂ

Error’d: Pickup Sticklers

From Prompt To Partner: Designing Your Custom AI Assistant

Microsoft unveils reimagined Marketplace for cloud solutions, AI apps, and more

Design Dialects: Breaking the Rules, Not the System

Building personal apps with open source and AI

What Can We Actually Do With corner-shape?

Craft, Clarity, and Care: The Story and Work of Mengchu Yao

Cailabs secures €57M to accelerate growth and industrial scale-up

Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

The first browser with JavaScript landed 30 years ago

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

**What is Machine Unlearning and how can it be attacked?**

Problem Formulation and Threat Model

Case 1: Relearning Attack Using a Portion of the Unlearn Set

TOFU

WHP

Quantitative results

Case 2: Relearning Attack Using Public Information

Recovering Harmful Knowledge in WMDP

Quantitative results

Recovering Verbatim Copyrighted Content in WHP

Quantitative results

Intuition from a Simplified Example

Conclusion

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

Designing for Touch: How Finger-Friendly UI Enhances UX

How To Create Kinetic Image Animations with React-Three-Fiber

OWASP Top 10 Vulnerabilities: A Guide for QA Testers

Updates from N|Solid Runtime: The Best Open-Source Node.js RT Just Got Better

The AI Fix #57: AI is the best hacker in the USA, and self-learning AI

Smashing Security podcast #415: Hacking hijinks at the hospital, and WASPI scams

GPT OSS models from OpenAI are now available on SageMaker JumpStart

Universal Design Principles Supporting Operable Content – Tolerance for Error

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

What is Machine Unlearning and how can it be attacked?

Problem Formulation and Threat Model

Case 1: Relearning Attack Using a Portion of the Unlearn Set

TOFU

WHP

Quantitative results

Case 2: Relearning Attack Using Public Information

Recovering Harmful Knowledge in WMDP

Quantitative results

Recovering Verbatim Copyrighted Content in WHP

Quantitative results

Intuition from a Simplified Example

Conclusion

Related Posts

**What is Machine Unlearning and how can it be attacked?**