Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning

Modern AI systems rely heavily on post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL) to adapt foundation models for specific tasks. However, a critical question remains unresolved: do these methods help models memorize training data or generalize to new scenarios? This distinction is vital for building robust AI systems capable of handling real-world variability.

*Reference: https://arxiv.org/pdf/2501.17161*

Prior work suggests SFT risks overfitting to training data, making models brittle when faced with new task variants. For example, an SFT-tuned model might excel at arithmetic problems using specific card values (e.g., treating ‘J’ as 11) but fail if the rules change (e.g., ‘J’ becomes 10). Similarly, RL’s reliance on reward signals could either encourage flexible problem-solving or reinforce narrow strategies. However, existing evaluations often conflate memorization and true generalization, leaving practitioners uncertain about which method to prioritize. In a latest paper from HKU, UC Berkeley, Google DeepMind, and NYU investigate this by comparing how SFT and RL affect a model’s ability to adapt to unseen rule-based and visual challenges.

They propose to test generalization in controlled settings to isolate memorization from generalization. Researchers designed two tasks: GeneralPoints (arithmetic reasoning) and V-IRL (visual navigation). Both tasks include in-distribution (ID) training data and out-of-distribution (OOD) variants to test adaptability:

Rule-Based Generalization (GeneralPoints, shown in Fig 3):
- Task: Create equations equal to 24 using four numbers from playing cards.
- Variants: Change card-value rules (e.g., ‘J’ = 11 vs. ‘J’ = 10) or card colors (red vs. blue).
- Goal: Determine if models learn arithmetic principles or memorize specific rules.

Visual Generalization (V-IRL, shown in Fig 4):
- Task: Navigate to a target location using visual landmarks.
- Variants: Switch action spaces (absolute directions like “north” vs. relative commands like “turn left”) or test in unseen cities.
- Goal: Assess spatial reasoning independent of memorized landmarks.

For experiments, the study uses Llama-3.2-Vision-11B as the base model, applying SFT first (standard practice) followed by RL. Key experiments measured performance on OOD tasks after each training phase. Let’s now discuss some critical insights from the paper:

How do SFT and RL Differ in Learning Mechanisms?

SFT’s Memorization Bias: SFT trains models to replicate correct responses from labeled data. While effective for ID tasks, this approach encourages pattern matching. For instance, if trained on red cards in GeneralPoints, the model associates color with specific number assignments. When tested on blue cards (OOD), performance plummets because it memorizes color-number correlations instead of arithmetic logic. Similarly, in V-IRL, SFT models memorize landmark sequences but struggle with new city layouts.

RL’s Generalization Strength: RL optimizes for reward maximization, which forces models to understand task structure. In GeneralPoints, RL-trained models adapt to new card rules by focusing on arithmetic relationships rather than fixed values. For V-IRL, RL agents learn spatial relationships (e.g., “left” means rotating 90 degrees) instead of memorizing turn sequences. This makes them robust to visual changes, such as unfamiliar landmarks.

Another critical insight is that RL benefits from verification iterations—multiple attempts to solve a task within a single training step. More iterations (e.g., 10 vs. 1) allow the model to explore diverse strategies, improving OOD performance by +5.99% in some cases.

In performance evaluation RL outperforms SFT consistently in both tasks as shown in Fig 5 & 6:

Rule-Based Tasks:
- RL improved OOD accuracy by +3.5% (GP-L) and +11.0% (V-IRL-L), while SFT degraded performance by -8.1% and -79.5%, respectively.
- Example: When card rules changed from ‘J=11’ to ‘J=10’, RL models adjusted equations using the new values, whereas SFT models reused invalid memorized solutions.
Visual Tasks:
- RL boosted OOD performance by +17.6% (GP-VL) and +61.1% (V-IRL-VL), while SFT dropped by -9.9% and -5.6%.
- In V-IRL, RL agents navigated unseen cities by recognizing spatial patterns, while SFT failed due to reliance on memorized landmarks.

The study also suggests that SFT is necessary to initialize models for RL. Without SFT, RL struggles because the base model lacks basic instruction-following skills. However, overly-tuned SFT checkpoints harm RL’s adaptability, where RL couldn’t recover OOD performance after excessive SFT. However, the researchers clarify that their findings—specific to the Llama-3.2 backbone model—do not conflict with earlier work such as DeepSeekAI et al. (2025), which proposed that SFT could be omitted for downstream RL training when using alternative base architectures.

In conclusion, this study demonstrates a clear trade-off: SFT excels at fitting training data but falters under distribution shifts, while RL prioritizes adaptable, generalizable strategies. For practitioners, this implies that RL should follow SFT—but only until the model achieves basic task competence. Over-reliance on SFT risks “locking in” memorized patterns, limiting RL’s ability to explore novel solutions. However, RL isn’t a panacea; it requires careful tuning (e.g., verification steps) and balanced initialization.

Check out the PAPER. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System ^(Promoted)

The post Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

7 MagSafe accessories that I recommend every iPhone user should have

I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

Student Record Android App using SQLite

Student Record Android App using SQLite

When Array uses less memory than Uint8Array (in V8)

Laravel 12 Starter Kits: Definite Guide Which to Choose

Photobooth is photobooth software for the Raspberry Pi and PC

Photobooth is photobooth software for the Raspberry Pi and PC

Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

Elden Ring DLC: How to beat Rellana, Twin Moon Knight in Shadow of the Erdtree

CVE-2025-3868 – WordPress Custom Admin-Bar Favorites Reflected Cross-Site Scripting

React Native â€“ A Web Developerâ€™s Perspective on Pivoting to Mobile

CVE-2025-4489 – Campcodes Online Food Ordering System SQL Injection

CVE-2025-4030 – “PHPGurukul COVID19 Testing Management System SQL Injection”

This $41 controller with Hall Effect sticks converted me to PC gaming

Is free Apple TV+ on the way? The streaming service is teasing something for next weekend

Call of Duty introduces a new Operator bundle to raise funds for fire relief in LA

Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning

Related Posts