Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit?

    Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit?

    May 14, 2024

    In the quickly developing fields of Artificial Intelligence and Data Science, the volume and accessibility of training data are critical factors in determining the capabilities and potential of Large Language Models (LLMs). Large volumes of textual data are used by these models to train and improve their language understanding skills.

    A recent tweet from Mark Cummins discusses how near we are to exhausting the global reservoir of text data required for training these models, given the exponential expansion in data consumption and the demanding specifications of next-generation LLMs. To explore this question, we share some textual sources currently available in different media and compare them to the increasing needs of sophisticated AI models.

    Web Data: Just the English text portion of the FineWeb dataset, which is a subset of the Common Crawl web data, has an astounding 15 trillion tokens. The corpus can double in size when top-notch non-English web content is added. 

    Code Repositories: Approximately 0.78 trillion tokens are contributed by publicly available code, such as that which is compiled in the Stack v2 dataset. While this may appear insignificant in comparison to other sources, the total amount of code worldwide is projected to be significant, amounting to tens of trillions of tokens. 

    Academic Publications and Patents: The total volume of academic publications and patents is approximately 1 trillion tokens, which is a sizable but unique subset of textual data.

    Books: With over 21 trillion tokens, digital book collections from sites like Google Books and Anna’s Archive make up a massive body of textual content. When every distinct book in the world is taken into account, the total token count rises to 400 trillion tokens. 

    Social Media Archives: User-generated material is hosted on platforms such as Weibo and Twitter, which together account for a token count of roughly 49 trillion. With 140 trillion tokens, Facebook stands out in particular. This is a significant but mostly unreachable resource because of privacy and ethical issues.

    Transcribing Audio: The training corpus gains around 12 trillion tokens from publicly accessible audio sources such as YouTube and TikTok.

    Private Communications: Emails and stored instant conversations add up to a massive amount of text data, roughly 1,800 trillion tokens when added together. Access to this data is limited, which raises privacy and ethical questions.

    There are ethical and logistical obstacles to future growth as the current LLM training datasets get close to the 15 trillion token level, which represents the amount of high-quality English text that is available. Reaching out to other resources like books, audio transcriptions, and different language corpora could result in small improvements, possibly increasing the maximum amount of readable, high-quality text to 60 trillion tokens. 

    However, token counts in private data warehouses run by Google and Facebook go into the quadrillions outside the purview of ethical business ventures. Because of the limitations imposed by limited and morally acceptable text sources, the future course of LLM development depends on the creation of synthetic data. Since access to private data reservoirs is prohibited, data synthesis appears to be a key future direction for AI research. 

    In conclusion, there is an urgent need for unique ways of LLM teaching, given the combination of growing data needs and limited text resources. In order to overcome the approaching limits of LLM training data, synthetic data becomes increasingly important as existing datasets get closer to saturation. This paradigm shift draws attention to how the field of AI research is changing and forces a deliberate turn towards synthetic data synthesis in order to maintain ongoing advancement and ethical compliance.

    The post Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit? appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild generative AI applications with Amazon Titan Text Premier, Amazon Bedrock, and AWS CDK
    Next Article OpenAI Launches ChatGPT Desktop App: Enhancing Productivity for Mac Users

    Related Posts

    Machine Learning

    LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    May 17, 2025
    Machine Learning

    This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Meet Plandex: An Open-Source Terminal-based AI Coding Engine for Complex Tasks

    Development

    Shaping Ligatures in Monospace Fonts

    Web Development

    Hire AI Developer: Cost Breakdown, Skills & Best Platforms for 2025

    Web Development

    Rilasciato Shotcut 25.05: Nuovi Potenti Strumenti per il Montaggio Video su GNU/Linux

    Linux

    Highlights

    CVE-2025-24346 – CtrlX OS Proxy Environment Variable Manipulation Vulnerability

    April 30, 2025

    CVE ID : CVE-2025-24346

    Published : April 30, 2025, 12:15 p.m. | 39 minutes ago

    Description : A vulnerability in the “Proxy” functionality of the web application of ctrlX OS allows a remote authenticated (lowprivileged) attacker to manipulate the “/etc/environment” file via a crafted HTTP request.

    Severity: 7.5 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Skype Will Shut Down on May 5, As Microsoft Shifts to Teams

    February 28, 2025

    Meta FAIR Releases Meta Motivo: A New Behavioral Foundation Model for Controlling Virtual Physics-based Humanoid Agents for a Wide Range of Complex Whole-Body Tasks

    December 17, 2024

    University of Toronto researchers build peptide prediction model that beats AlphaFold 2

    June 28, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.