Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining

    HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining

    June 3, 2024

    Hugging Face has introduced FineWeb, a comprehensive dataset designed to enhance the training of large language models (LLMs). Published on May 31, 2024, this dataset sets a new benchmark for pretraining LLMs, promising improved performance through meticulous data curation and innovative filtering techniques.

    FineWeb draws from 96 CommonCrawl snapshots, encompassing a staggering 15 trillion tokens and occupying 44TB of disk space. CommonCrawl, a non-profit organization that has been archiving the web since 2007, provided the raw material for this dataset. Hugging Face leveraged these extensive web crawls to compile a rich and diverse dataset, aiming to surpass the capabilities of previous datasets like RefinedWeb and C4.

    Image Source

    One of the standout features of FineWeb is its rigorous deduplication process. Using MinHash, a fuzzy hashing technique, the team at Hugging Face ensured that redundant data was effectively eliminated. This process improves the model’s performance by reducing duplicate content memorization and enhancing training efficiency. The dataset underwent individual and global deduplication, with the former proving more beneficial in retaining high-quality data.

    Quality is a cornerstone of FineWeb. The dataset employs advanced filtering strategies to remove low-quality content. Initial steps involved language classification and URL filtering to exclude non-English text and adult content. Building on the foundation laid by C4, additional heuristic filters were applied, such as removing documents with excessive boilerplate content or those failing to end lines with punctuation.

    Image Source

    Accompanying the primary dataset, Hugging Face introduced FineWeb-Edu, a subset tailored for educational content. This subset was created using synthetic annotations generated by Llama-3-70B-Instruct, which scored 500,000 samples on their academic value. A classifier trained on these annotations was then applied to the full dataset, filtering out non-educational content. The result is a dataset of 1.3 trillion tokens optimized for educational benchmarks such as MMLU, ARC, and OpenBookQA.

    Image Source

    FineWeb has been rigorously tested against several benchmarks, consistently outperforming other open web-scale datasets. The dataset’s performance is validated through a series of “early-signal” benchmarks using small models. These benchmarks include CommonSense QA, HellaSwag, and OpenBook QA, among others. FineWeb-Edu, in particular, showed remarkable improvements, demonstrating the effectiveness of synthetic annotations for high-quality educational content filtering.

    Hugging Face’s release of FineWeb marks a pivotal moment in the open science community. It provides researchers and users with a powerful tool to train high-performance LLMs. The dataset, released under the permissive ODC-By 1.0 license, is accessible for further research and development. Looking ahead, Hugging Face aims to extend the principles of FineWeb to other languages, thus broadening the impact of high-quality web data across diverse linguistic contexts.

    The post HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBeyond the Reference Model: SimPO Unlocks Efficient and Scalable RLHF for Large Language Models
    Next Article Unit testing Apache TinkerPop transactions: From TinkerGraph to Amazon Neptune

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Unlock Better Designs

    Web Development

    Hugging Face Releases SmolTools: A Collection of Lightweight AI-Powered Tools Built with LLaMA.cpp and Small Language Models

    Development

    Synopsis of several compelling features in PostgreSQL 16

    Databases

    Hackers Compromise Ethereum Mailing List to Send Phishing Emails Directing Subscribers to Crypto Drainers

    Development

    Highlights

    Development

    You don’t need to wait for Prime Day to get a stacked RTX 4060 gaming laptop for $1,000

    July 5, 2024

    The Dell G16 Gaming (7630) is made for competitive play thanks to a mechanical keyboard…

    Microsoft makes it easier to use Classic Outlook with new Outlook on Windows 11

    May 5, 2025

    CES 2025 ICYMI: The most impressive products so far

    January 6, 2025

    SonicWall-lek dat voor fabrieksreset zorgt mogelijk misbruikt bij aanvallen

    May 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.