Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      5 preinstalled apps you should delete from your Samsung phone immediately

      July 30, 2025

      Ubuntu Linux lagging? Try my 10 go-to tricks to speed it up

      July 30, 2025

      How I survived a week with this $130 smartwatch instead of my Garmin and Galaxy Ultra

      July 30, 2025

      YouTube is using AI to verify your age now – and if it’s wrong, that’s on you to fix

      July 30, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025
      Recent

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025

      Create Apple Wallet Passes in Laravel

      July 30, 2025

      The Laravel Idea Plugin is Now FREE for PhpStorm Users

      July 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025
      Recent

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025

      Opera throws Microsoft to Brazil’s watchdogs for promoting Edge as your default browser — “Microsoft thwarts‬‭ browser‬‭ competition‬‭‬‭ at‬‭ every‬‭ turn”

      July 30, 2025

      Activision once again draws the ire of players for new Diablo Immortal marketing that appears to have been made with generative AI

      July 30, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level

    How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level

    June 11, 2025

    Introduction: The Challenge of Memorization in Language Models

    Modern language models face increasing scrutiny regarding their memorization behavior. With models such as an 8-billion parameter transformer trained on 15 trillion tokens, researchers question whether these models memorize their training data in a meaningful way. Common techniques, including data extraction and membership inference, fall short as they often fail to distinguish between memorization and generalization.

    Limitations of Existing Approaches

    Previous frameworks like extraction-based methods or differential privacy operate at the dataset level, not accounting for instance-specific memorization. Language modeling through compression and assessments of capacity through fact memorization (as in RNNs and quantized transformers) offer partial insight but lack scalability and precision, especially for deep transformer architectures.

    A Novel Approach to Measuring Memorization

    Researchers from FAIR at Meta, Google DeepMind, Cornell University, and NVIDIA have proposed a novel method for estimating how much a model “knows” about specific datapoints to measure the capacity of modern language models. They separate memorization into two components: unintended memorization, which represents the information a model contains about a dataset, and generalization, which captures the information about the true data-generation process. They calculate total memorization to provide accurate estimates of model capacity by removing generalization, showing that GPT family models have an approximate capacity of 3.6 bits-per-parameter. Researchers also developed a series of scaling laws that relate model capacity and data size to membership inference by training hundreds of transformer language models.

    Experimental Framework and Training Methodology

    Using the GPT-2 architecture, the team trained hundreds of models ranging from 100K to 20M parameters, varying depths (1-8 layers), and hidden sizes (32-512). Training involved:

    • 10^6 steps
    • Batch size: 2048
    • Precision: bfloat16
    • Hardware: Single A100 GPU

    These models were trained on both synthetic sequences and deduplicated 64-token text sequences from the FineWeb dataset. The experiments ensured minimal interference from generalization through careful dataset construction.

    Model Capacity Insights and Key Findings

    • Bits per parameter: Across configurations, models consistently stored between 3.5 and 3.6 bits/parameter.
    • Double descent: As training dataset size approaches model capacity, test loss initially decreases (overfitting), then improves again as models begin generalizing.
    • Precision impact: Training in float32 increases storage capacity slightly (to ~3.83 bpp) compared to bfloat16 (~3.51 bpp).

    Disentangling Memorization and Generalization

    Switching from synthetic to real-text datasets, the team observed:

    • Sample-level unintended memorization increases with parameter count.
    • Memorization decreases as training set size increases.
    • Accurate estimation of model memorization requires deduplication and reference to an oracle model for baseline compression rates.

    Membership Inference Scaling Laws

    The researchers modeled the success rate (F1 score) of loss-based membership inference as a function of the ratio between model capacity and dataset size. Key observations:

    • Membership inference becomes unreliable as datasets grow.
    • Predictive scaling laws remain accurate within 1-2% for models up to 1.5B parameters.

    Conclusion: A Better Understanding of Model Behavior

    This work establishes a principled framework for measuring memorization in language models. By introducing quantifiable metrics and scalable experiments, it deepens our understanding of how transformer models encode training data and draws a clear boundary between memorization and generalization. The resulting insights can guide future developments in model evaluation, privacy, and interpretability.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    ▶ Want to promote your product/webinar/service to 1 Million+ AI Engineers/Developers/Data Scientists/Architects/CTOs/CIOs? Lets Partner..

    The post How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleNVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs
    Next Article Supercharging Workflows with AI Agent and Copilot Development🚀

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Microsoft warns its Quick Assist app could expose Windows and macOS users to AI-driven tech support scams and “scareware”

    News & Updates

    Four Mysterious xCloud Program Codenames Surface in Microsoft’s API Hinting at New Game Pass Tiers

    Operating Systems

    CVE-2025-4240 – PCMan FTP Server LCD Command Handler Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    I Made Kitty Terminal Even More Awesome by Using These 15 Customization Tips and Tweaks

    Learning Resources

    Highlights

    CVE-2025-49183 – Apache HTTP Unencrypted Communication Vulnerability

    June 12, 2025

    CVE ID : CVE-2025-49183

    Published : June 12, 2025, 2:15 p.m. | 1 hour, 46 minutes ago

    Description : All communication with the REST API is unencrypted (HTTP), allowing an attacker to intercept traffic between an actor and the webserver. This leads to the possibility of information gathering and downloading media files.

    Severity: 7.5 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Packet is an Android Quick Share App for Linux

    June 5, 2025

    Why Temu’s bargain prices are about to hit a tariff wall

    April 4, 2025

    Orbit by Mozilla (AI Add-on for Firefox) Shuts Down This Month

    June 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.