How AI Scales with Data Size? This Paper from Stanford Introduces a New Class of Individualized Data Scaling Laws for Machine Learning

Machine learning models for vision and language, have shown significant improvements recently, thanks to bigger model sizes and a huge amount of high-quality training data. Research shows that more training data improves models predictably, leading to scaling laws that explain the link between error rates and dataset size. These scaling laws help decide the balance between model size and data size, but they look at the dataset as a whole without considering individual training examples. This is a limitation because some data points are more valuable than others, especially in noisy datasets collected from the web. So, it is crucial to understand how each data point or source affects model training.Â

The related works in this paper discuss a method called Scaling Laws for deep learning, which have become popular in recent years. These laws help in several ways, including understanding the trade-offs between increasing training data and model size, predicting the performance of large models, and comparing how well different learning algorithms perform at smaller scales. The second approach focuses on how individual data points can improve the modelâ€™s performance. These methods usually score training examples based on their marginal contribution. They can identify mislabeled data, filter out high-quality data, upweight helpful examples, and select promising new data points for active learning.

Researchers from Stanford University have introduced a new approach by investigating scaling behavior for the value of individual data points. They found that the contribution of a data point to a modelâ€™s performance decreases predictably as the dataset grows larger, following a log-linear pattern. However, this decrease varies among data points, meaning that some points are more useful in smaller datasets, while others become more valuable in larger datasets. Moreover, a maximum likelihood estimator and an amortized estimator were introduced to efficiently learn these individual patterns from a small number of noisy observations for each data point.

Experiments are carried out to provide evidence for the parametric scaling law, focusing on three types of models: logistic regression, SVMs, and MLPs (specifically, two-layer ReLU networks). These models are tested on three datasets: MiniBooNE, CIFAR-10, and IMDB movie reviews. Pre-trained embeddings like frozen ResNet-50 and BERT, are used to speed up training and prevent underfitting for CIFAR-10 and IMDB, respectively. The performance of each model is measured using cross-entropy loss on a test dataset of 1000 samples. For logistic regression, 1000 data points and 1000 samples per k value are used. For SVMs and MLPs, due to the higher variance in marginal contributions, 200 data points and 5000 samples per dataset size k are used.

The proposed methods are tested by predicting how accurate the marginal contributions are at each dataset size. For instance, with the IMDB dataset and logistic regression, expectations can accurately be predicted for dataset sizes ranging from k = 100 to k = 1000. To systematically evaluate this, the accuracy of the scaling law predictions is shown across different dataset sizes for both versions of a likelihood-based estimator using different samples. A more detailed version of these results shows the reduction of the R2 score when predictions are extended beyond k = 2500, while the correlation and rank correlation with the true expectations stays high.

In conclusion, researchers from Stanford University have developed a new method by examining how the value of individual data points changes with scale. They found evidence for a simple pattern that works across different datasets and model types. Experiments confirmed this scaling law by showing a clear log-linear trend and testing how well it predicts contributions at different dataset sizes. The scaling law can be used to predict behavior for larger datasets than those initially tested. However, measuring this behavior for an entire training dataset is expensive, so researchers developed ways to measure the scaling parameters using a small number of noisy observations per data point.

high-quality data in AI research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 46k+ ML SubReddit

The post How AI Scales with Data Size? This Paper from Stanford Introduces a New Class of Individualized Data Scaling Laws for Machine Learning appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

A cross-platform Markdown note-taking application

A cross-platform Markdown note-taking application

AI Assistant Demo & Tips for Enterprise Projects

Celebrating Global Accessibility Awareness Day (GAAD)

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

How AI Scales with Data Size? This Paper from Stanford Introduces a New Class of Individualized Data Scaling Laws for Machine Learning

February 2025 Baseline monthly digest

Markus Buehler receives 2025 Washington Award

Finally, a wireless microphone that effectively replaces my Shure shotgun when traveling

New hinge patent reveals Microsoft was exploring possible foldable flip smartphone designs

BBC Data Breach: Over 25,000 Employee Records Compromised, Investigation Underway

Sam Altman predicts superintelligence will trigger a 10x surge in AI breakthroughs — each year as revolutionary as a decade

Handling Missing Request Data in Laravel

Leaked KeyPlug Malware Infrastructure Contains Exploit Scripts to Hack Fortinet Firewall and VPN

Create Experiences with Multiple Layouts in Sitecore Personalize

Gpredict is a real-time satellite tracking and orbit prediction application

How AI Scales with Data Size? This Paper from Stanford Introduces a New Class of Individualized Data Scaling Laws for Machine Learning

Related Posts