Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      tRPC vs GraphQL vs REST: Choosing the right API design for modern web applications

      June 26, 2025

      Jakarta EE 11 Platform launches with modernized Test Compatibility Kit framework

      June 26, 2025

      Can Good UX Protect Older Users From Digital Scams?

      June 25, 2025

      Warp 2.0 evolves terminal experience into an Agentic Development Environment

      June 25, 2025

      The top 4 Bluetooth speakers I’m taking everywhere this summer (including a surprise pick)

      June 27, 2025

      Your Android phone is getting a big security upgrade for free – here’s what’s new

      June 27, 2025

      How a 5-minute circuit scan saved me hundreds (and exposed a serious wiring surprise)

      June 27, 2025

      Using AI saves teachers ‘six weeks per year,’ Gallup poll finds – but at what cost?

      June 27, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      billboard.js 3.16.0 release: ✨ bar trending line & improved resizing performance!

      June 27, 2025
      Recent

      billboard.js 3.16.0 release: ✨ bar trending line & improved resizing performance!

      June 27, 2025

      ISO 20022 – End of MT Coexistence for Cash Instructions Fast Approaching

      June 27, 2025

      Building Trust and Shaping the Future: Implementing Responsible AI – Part 2

      June 27, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 KB5060826 fixes slow Search, direct download links

      June 27, 2025
      Recent

      Windows 11 KB5060826 fixes slow Search, direct download links

      June 27, 2025

      Rilasciata Tails 6.17: Più Privacy e Sicurezza con le Nuove Funzionalità

      June 27, 2025

      Rilasciata Deepin 25: La distribuzione GNU/Linux immutabile con assistente vocale e pacchetti universali

      June 27, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints

    Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints

    April 17, 2025

    The Challenge of Data Selection in LLM Pretraining

    Developing large language models entails substantial computational investment, especially when experimenting with alternative pretraining corpora. Comparing datasets at full scale—on the order of billions of parameters and hundreds of billions of tokens—can consume hundreds of thousands of GPU hours per run. Consequently, practitioners resort to smaller‐scale experiments as proxies for large‐model behavior. Yet these “pilot” studies are rarely published, producing a fragmented landscape in which each laboratory repeats similar small‐scale tests without shared benchmarks or methodologies . This opacity impedes reproducibility, underutilizes collective insights, and obscures the true trade‑offs between development compute and final model performance.

    DataDecide

    To address these limitations, the Allen Institute for AI (AI2), in collaboration with the University of Washington and the University of Pennsylvania, today releases DataDecide—a comprehensive suite of controlled pretraining experiments spanning 25 distinct corpora and 14 model sizes from 4 million to 1 billion parameters. DataDecide’s datasets include well‑known sources such as Dolma, DCLM, RefinedWeb, C4, and FineWeb, alongside variations produced by domain ablation, deduplication, quality filtering, and source mixing. Each model is trained at a fixed token‑to‑parameter ratio of 100 (100 tokens per parameter), reflecting the “overtraining” regime that optimizes inference efficiency. In total, over 1,050 models and more than 30,000 checkpoints—each evaluated across ten downstream tasks—are released to the public.

    Technical Structure and Pragmatic Benefits

    DataDecide orchestrates experiments along three axes:

      • Data Recipes: Twenty‑five well‑documented pretraining corpora, each embodying different curation strategies (see Table 1 in the paper for full recipe specifications) .
      • Model Scale: Fourteen parameter configurations (4 M–1 B), programmatically derived via the OLMo model ladder to ensure consistent training hyperparameters across scales. Each non‑target scale includes two “early‑stop” seed runs, while the 1 B‑parameter models feature three complete seed reruns to quantify variability.
      • Evaluation Suite: The OLMES benchmark of ten multiple‑choice tasks (e.g., MMLU, ARC Easy/Challenge, HellaSwag, MBPP, HumanEval) provides a multifaceted view of language understanding, commonsense reasoning, and code generation performance.

      By releasing both pretraining datasets and corresponding models, DataDecide enables researchers to:

      • Reuse checkpoints for new evaluations without retraining.
      • Experiment with novel prediction methods (e.g., advanced scaling‑law fits, smoothing techniques).
      • Investigate benchmark sensitivity to training data and model scale.

      Key Findings and Quantitative Insights

      DataDecide’s systematic analysis yields four practical guidelines:

        • Single‑Scale Baseline Robustness: Ranking corpora by downstream accuracy at a single, small scale (e.g., 150 M parameters) achieves ~80 percent decision accuracy for predicting the best dataset at the 1 B‑parameter target scale. In contrast, eight baseline scaling‑law extrapolations do not surpass this simple heuristic, underscoring its cost‑effectiveness.
        • Task‑Dependent Compute Sensitivity: The compute budget required for reliable decisions varies markedly by task. Benchmarks like MMLU and ARC Easy become predictable with less than 0.01 percent of the target compute, whereas HellaSwag and SocialIQA demand orders of magnitude more FLOPs to achieve similar decision accuracy .
        • Proxy Metric Selection: Continuous likelihood metrics—specifically the character‑normalized average probability of correct continuations (CORRECT PROB) and total probability (TOTAL PROB)—outperform discrete accuracy measures at small scales. This is most pronounced on code tasks (MBPP, HumanEval), where decision accuracy jumps from near‑random to over 80 percent with CORRECT PROB as the proxy .
        • Variance and Spread Considerations: High decision accuracy correlates with low run‑to‑run variance (noise) and ample performance spread across datasets. Proxy metrics that reduce noise or amplify spread thus directly enhance prediction reliability.

        Concluding Perspective

        DataDecide transforms pretraining data selection from an ad hoc art into a transparent, data‐driven science. By open‑sourcing all 25 corpora, 1,050 models, 30,000+ checkpoints, and evaluation scripts on Hugging Face and GitHub, AI2 invites the community to reproduce findings, extend evaluations to new benchmarks, and innovate on decision‑making methods. As LLM development continues to demand ever‑greater compute resources, DataDecide offers a principled framework for minimizing wasted experiments and maximizing insight—paving the way toward more efficient, reproducible, and collaborative AI research.


        Check out the Paper, Model on Hugging Face and Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

        🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

          The post Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints appeared first on MarkTechPost.

          Source: Read More 

          Facebook Twitter Reddit Email Copy Link
          Previous ArticleCodeSOD: Static State
          Next Article OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning

          Related Posts

          Machine Learning

          How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

          June 27, 2025
          Machine Learning

          Using Amazon SageMaker AI Random Cut Forest for NASA’s Blue Origin spacecraft sensor data

          June 26, 2025
          Leave A Reply Cancel Reply

          For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

          Continue Reading

          MongoDB Atlas 與生成式AI完美結合,永豐銀行數位金融服務再進化

          Databases

          The 7 gadgets I never travel without (and why they make such a big difference)

          News & Updates

          CVE-2025-5577 – PHPGurukul Dairy Farm Shop Management System SQL Injection Vulnerability

          Common Vulnerabilities and Exposures (CVEs)

          CVE-2025-31260 – Apple macOS Sequoia Permission Escalation Vulnerability

          Common Vulnerabilities and Exposures (CVEs)

          Highlights

          CISA Adds Apple and TP-Link Vulnerabilities to KEV Catalog

          June 17, 2025

          CISA Adds Apple and TP-Link Vulnerabilities to KEV Catalog

          On June 16, 2025, the Cybersecurity and Infrastructure Security Agency (CISA) expanded its Known Exploited Vulnerabilities (KEV) Catalog by adding two high-risk vulnerabilities — one affecting Apple d …
          Read more

          Published Date:
          Jun 17, 2025 (3 hours, 34 minutes ago)

          Vulnerabilities has been mentioned in this article.

          CVE-2025-43200

          CVE-2025-26685

          CVE-2025-21298

          CVE-2023-33538

          The easiest way to try out Ubuntu Linux

          April 9, 2025

          Rilasciata Rocky Linux 10: La Nuova Alternativa Libera a Red Hat Enterprise Linux 10

          June 12, 2025

          Some of my favorite FPS games are getting big remasters next week, and you’ll get them free if you own the originals

          May 15, 2025
          © DevStackTips 2025. All rights reserved.
          • Contact
          • Privacy Policy

          Type above and press Enter to search. Press Esc to cancel.