This AI Paper from Meta and MBZUAI Introduces a Principled AI Framework to Examine Highly Accurate Scaling Laws Concerning Model Size Versus Its Knowledge Storage Capacity

Research on scaling laws for LLMs explores the relationship between model size, training time, and performance. While established principles suggest optimal training resources for a given model size, recent studies challenge these notions by showing that smaller models with more computational resources can outperform larger ones. Despite understanding emergent behaviors in large models, there needs to be more quantitative analysis on how model size affects its capacity post-sufficient training. Traditional theories propose that increasing model size improves memorization, generalization, and fitting complex functions, but practical outcomes often deviate due to overlooked factors.

Researchers from Meta/FAIR Labs and Mohamed bin Zayed University of AI have devised a systematic framework to investigate the precise scaling laws governing the relationship between the size of LMs and their capacity to store knowledge. While itâ€™s commonly assumed that larger models can hold more knowledge, the study aims to determine whether the total knowledge scales linearly with model size and what constant defines this scaling. Understanding this constant is pivotal for evaluating the efficiency of transformer models in knowledge storage and how various factors like architecture, quantization, and training duration impact this capacity. They train language models of varying sizes by defining knowledge as (name, attribute, value) tuples and generating synthetic datasets. They evaluate their knowledge storage efficiency by comparing trainable parameters to the minimum bits required to encode the knowledge.

Language models store factual knowledge as tuples, each consisting of three strings: (name, attribute, and value). The study estimates the number of knowledge bits a language model can store, with findings indicating that models can store 2 bits of knowledge per parameter. Training duration, model architecture, quantization, sparsity constraints, and data signal-to-noise ratio impact a modelâ€™s knowledge storage capacity. Prepending training data with domain names like wikipedia.org significantly increases a modelâ€™s knowledge capacity by allowing models to identify and prioritize domains rich in knowledge.

In the investigation, the researchers focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.), and establish that language models can store approximately 2 bits of knowledge per parameter, even with quantization to int8. Moreover, they find that appending domain names to training data significantly enhances a modelâ€™s knowledge capacity, enabling language models to identify and prioritize domains rich in knowledge autonomously. Through controlled experiments, they elucidate how factors like training duration, architecture, quantization, sparsity constraints, and data signal-to-noise ratio affect a modelâ€™s knowledge storage capacity, offering valuable insights for developing and optimizing language models.

The study outlines key findings on language model capacity:

GPT2 consistently achieves a 2-bit per parameter capacity ratio across diverse data settings, implying a 7B model could exceed the knowledge in English Wikipedia.

Longer training time, with 1000 exposures per knowledge piece, is crucial for maintaining this ratio.

Model architecture influences capacity, with GPT2 outperforming LLaMA/Mistral due to gated MLP.

Quantization to int8 maintains capacity, while int4 reduces it.

Mixture-of-experts models slightly decrease capacity but remain efficient.

Junk data significantly reduces model capacity, but prepending useful data mitigates this effect. This systematic approach offers precise comparisons of models and insights into critical aspects like training time, architecture, quantization, and data quality.

In conclusion, researchers discovered a consistent pattern in investigating language model scaling laws: a fully-trained transformer model can effectively store 2 bits of knowledge per parameter, regardless of its size or other factors, such as quantization to int8. They explored the impact of various hyperparameters on these scaling laws, including training duration, model architectures, precision, and data quality. The methodology offers a rigorous framework for comparing model capabilities, aiding practitioners in decision-making regarding model selection and training. Moreover, the research lays the groundwork for addressing the fundamental question of optimal language model size, potentially informing future advancements toward achieving Artificial General Intelligence (AGI).

Check out theÂ Paper.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

Want to get in front of 1.5 Million AI Audience?Â Work with us here

The post This AI Paper from Meta and MBZUAI Introduces a Principled AI Framework to Examine Highly Accurate Scaling Laws Concerning Model Size Versus Its Knowledge Storage Capacity appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

This AI Paper from Meta and MBZUAI Introduces a Principled AI Framework to Examine Highly Accurate Scaling Laws Concerning Model Size Versus Its Knowledge Storage Capacity

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

Steps to improve your charity website’s navigation

CVE-2025-31930 – Schneider Electric Modbus Remote Control Vulnerability

How Incremental Static Regeneration (ISR) Works in Next.js

I’m a long-time Apple Watch user, but the Galaxy Ring beats it in one big way

NSO Group Exploited WhatsApp to Install Pegasus Spyware Even After Meta’s Lawsuit

Iranian MuddyWater Hackers Adopt New C2 Tool ‘DarkBeatC2’ in Latest Campaign

How managing networks differs on Windows 10 and Linux

CVE-2025-4698 – PHPGurukul Directory Management System SQL Injection Vulnerability

This AI Paper from Meta and MBZUAI Introduces a Principled AI Framework to Examine Highly Accurate Scaling Laws Concerning Model Size Versus Its Knowledge Storage Capacity

Related Posts