Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      7 MagSafe accessories that I recommend every iPhone user should have

      June 1, 2025

      I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

      June 1, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025
      Recent

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

      June 1, 2025

      Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning

    Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning

    January 31, 2025

    Modern AI systems rely heavily on post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL) to adapt foundation models for specific tasks. However, a critical question remains unresolved: do these methods help models memorize training data or generalize to new scenarios? This distinction is vital for building robust AI systems capable of handling real-world variability.

    Reference: https://arxiv.org/pdf/2501.17161

    Prior work suggests SFT risks overfitting to training data, making models brittle when faced with new task variants. For example, an SFT-tuned model might excel at arithmetic problems using specific card values (e.g., treating ‘J’ as 11) but fail if the rules change (e.g., ‘J’ becomes 10). Similarly, RL’s reliance on reward signals could either encourage flexible problem-solving or reinforce narrow strategies. However, existing evaluations often conflate memorization and true generalization, leaving practitioners uncertain about which method to prioritize.  In a latest paper from HKU, UC Berkeley, Google DeepMind, and NYU investigate this by comparing how SFT and RL affect a model’s ability to adapt to unseen rule-based and visual challenges.

    They propose to test generalization in controlled settings to isolate memorization from generalization. Researchers designed two tasks: GeneralPoints (arithmetic reasoning) and V-IRL (visual navigation). Both tasks include in-distribution (ID) training data and out-of-distribution (OOD) variants to test adaptability:

    1. Rule-Based Generalization (GeneralPoints, shown in Fig 3):
      • Task: Create equations equal to 24 using four numbers from playing cards.
      • Variants: Change card-value rules (e.g., ‘J’ = 11 vs. ‘J’ = 10) or card colors (red vs. blue).
      • Goal: Determine if models learn arithmetic principles or memorize specific rules.
    Reference: https://arxiv.org/pdf/2501.17161
    1. Visual Generalization (V-IRL, shown in Fig 4):
      • Task: Navigate to a target location using visual landmarks.
      • Variants: Switch action spaces (absolute directions like “north” vs. relative commands like “turn left”) or test in unseen cities.
      • Goal: Assess spatial reasoning independent of memorized landmarks.
    Reference: https://arxiv.org/pdf/2501.17161

    For experiments, the study uses Llama-3.2-Vision-11B as the base model, applying SFT first (standard practice) followed by RL. Key experiments measured performance on OOD tasks after each training phase. Let’s now discuss some critical insights from the paper:

    How do SFT and RL Differ in Learning Mechanisms?

    • SFT’s Memorization Bias: SFT trains models to replicate correct responses from labeled data. While effective for ID tasks, this approach encourages pattern matching. For instance, if trained on red cards in GeneralPoints, the model associates color with specific number assignments. When tested on blue cards (OOD), performance plummets because it memorizes color-number correlations instead of arithmetic logic. Similarly, in V-IRL, SFT models memorize landmark sequences but struggle with new city layouts.
    • RL’s Generalization Strength: RL optimizes for reward maximization, which forces models to understand task structure. In GeneralPoints, RL-trained models adapt to new card rules by focusing on arithmetic relationships rather than fixed values. For V-IRL, RL agents learn spatial relationships (e.g., “left” means rotating 90 degrees) instead of memorizing turn sequences. This makes them robust to visual changes, such as unfamiliar landmarks.

    Another critical insight is that RL benefits from verification iterations—multiple attempts to solve a task within a single training step. More iterations (e.g., 10 vs. 1) allow the model to explore diverse strategies, improving OOD performance by +5.99% in some cases.

    In performance evaluation RL outperforms SFT consistently in both tasks as shown in Fig 5 & 6:

    1. Rule-Based Tasks: 
      • RL improved OOD accuracy by +3.5% (GP-L) and +11.0% (V-IRL-L), while SFT degraded performance by -8.1% and -79.5%, respectively.
      • Example: When card rules changed from ‘J=11’ to ‘J=10’, RL models adjusted equations using the new values, whereas SFT models reused invalid memorized solutions.
    2. Visual Tasks:
      • RL boosted OOD performance by +17.6% (GP-VL) and +61.1% (V-IRL-VL), while SFT dropped by -9.9% and -5.6%.
      • In V-IRL, RL agents navigated unseen cities by recognizing spatial patterns, while SFT failed due to reliance on memorized landmarks.

    The study also suggests that SFT is necessary to initialize models for RL. Without SFT, RL struggles because the base model lacks basic instruction-following skills. However, overly-tuned SFT checkpoints harm RL’s adaptability, where RL couldn’t recover OOD performance after excessive SFT. However, the researchers clarify that their findings—specific to the Llama-3.2 backbone model—do not conflict with earlier work such as DeepSeekAI et al. (2025), which proposed that SFT could be omitted for downstream RL training when using alternative base architectures.

    In conclusion, this study demonstrates a clear trade-off: SFT excels at fitting training data but falters under distribution shifts, while RL prioritizes adaptable, generalizable strategies. For practitioners, this implies that RL should follow SFT—but only until the model achieves basic task competence. Over-reliance on SFT risks “locking in” memorized patterns, limiting RL’s ability to explore novel solutions. However, RL isn’t a panacea; it requires careful tuning (e.g., verification steps) and balanced initialization.


    Check out the PAPER. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

    🚨 Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

    The post Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCuriosity-Driven Reinforcement Learning from Human Feedback CD-RLHF: An AI Framework that Mitigates the Diversity Alignment Trade-off In Language Models
    Next Article The Allen Institute for AI (AI2) Releases Tülu 3 405B: Scaling Open-Weight Post-Training with Reinforcement Learning from Verifiable Rewards (RLVR) to Surpass DeepSeek V3 and GPT-4o in Key Benchmarks

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

    June 1, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Elden Ring DLC: How to beat Rellana, Twin Moon Knight in Shadow of the Erdtree

    Development

    CVE-2025-3868 – WordPress Custom Admin-Bar Favorites Reflected Cross-Site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    React Native – A Web Developer’s Perspective on Pivoting to Mobile

    Development

    CVE-2025-4489 – Campcodes Online Food Ordering System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-4030 – “PHPGurukul COVID19 Testing Management System SQL Injection”

    April 28, 2025

    CVE ID : CVE-2025-4030

    Published : April 28, 2025, 6:15 p.m. | 50 minutes ago

    Description : A vulnerability was found in PHPGurukul COVID19 Testing Management System 1.0. It has been classified as critical. This affects an unknown part of the file /search-report-result.php. The manipulation of the argument serachdata leads to sql injection. It is possible to initiate the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 7.3 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    This $41 controller with Hall Effect sticks converted me to PC gaming

    April 28, 2025

    Is free Apple TV+ on the way? The streaming service is teasing something for next weekend

    December 27, 2024

    Call of Duty introduces a new Operator bundle to raise funds for fire relief in LA

    January 22, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.