Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Stop writing tests: Automate fully with Generative AI

      August 19, 2025

      Opsera’s Codeglide.ai lets developers easily turn legacy APIs into MCP servers

      August 19, 2025

      Black Duck Security GitHub App, NuGet MCP Server preview, and more – Daily News Digest

      August 19, 2025

      10 Ways Node.js Development Boosts AI & Real-Time Data (2025-2026 Edition)

      August 18, 2025

      This new Coros watch has 3 weeks of battery life and tracks way more – even fly fishing

      August 20, 2025

      5 ways automation can speed up your daily workflow – and implementation is easy

      August 20, 2025

      This new C-suite role is more important than ever in the AI era – here’s why

      August 20, 2025

      iPhone users may finally be able to send encrypted texts to Android friends with iOS 26

      August 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Creating Dynamic Real-Time Features with Laravel Broadcasting

      August 20, 2025
      Recent

      Creating Dynamic Real-Time Features with Laravel Broadcasting

      August 20, 2025

      Understanding Tailwind CSS Safelist: Keep Your Dynamic Classes Safe!

      August 19, 2025

      Sitecore’s Content SDK: Everything You Need to Know

      August 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Why GNOME Replaced Eye of GNOME with Loupe as the Default Image Viewer

      August 19, 2025
      Recent

      Why GNOME Replaced Eye of GNOME with Loupe as the Default Image Viewer

      August 19, 2025

      Microsoft admits it broke “Reset this PC” in Windows 11 23H2 KB5063875, Windows 10 KB5063709

      August 19, 2025

      How to Fix “EA AntiCheat Has Detected an Incompatible Driver” on Windows 11?

      August 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

    Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

    May 15, 2025

    The data quality used in pretraining LLMs has become increasingly critical to their success. To build information-rich corpora, researchers have moved from heuristic filtering methods, such as rule-based noise removal and deduplication, to model-driven filtering, which leverages neural classifiers to identify high-quality samples. Despite its benefits, this approach still faces key issues: it lacks efficient validation mechanisms to assess data quality promptly and often relies on manually curated seed datasets that introduce subjectivity. While early datasets like C4 and Pile laid the groundwork for model development, recent efforts like RefinedWeb, Dolma, and DCLM have scaled significantly, incorporating up to trillions of tokens. Model-driven filtering has gained traction in these newer corpora for its ability to refine massive datasets and enhance LLM performance across downstream tasks.

    Nevertheless, the effectiveness of model-driven filtering is limited by the high costs and inefficiencies of current validation methods and the absence of clear standards for seed data selection. Recent datasets, such as FineWeb-edu and Ultra-FineWeb, have demonstrated improved model performance by using multiple classifiers to cross-verify data quality. These datasets outperform previous versions on benchmarks like MMLU, ARC, and C-Eval, indicating that refined filtering methods can enhance English and Chinese understanding. To further optimize this process, some studies propose using LLMs for multi-dimensional data evaluation via prompts or leveraging token-level perplexity scores. These innovations aim to lower computational overhead while improving data quality, ultimately enabling more effective training with fewer tokens. 

    Researchers from ModelBest Inc., Tsinghua University, and Soochow University developed an efficient data filtering pipeline to improve LLM training. They introduced a verification strategy that uses a nearly-trained LLM to evaluate new data by observing performance gains during final training steps, reducing computational costs. A lightweight fastText-based classifier further enhances filtering speed and accuracy. Applied to FineWeb and Chinese FineWeb datasets, this method produced the Ultra-FineWeb dataset, containing 1 trillion English and 120 billion Chinese tokens. LLMs trained on Ultra-FineWeb showed notable performance gains, confirming the pipeline’s effectiveness in improving data quality and training efficiency. 

    The study outlines an efficient, high-quality data filtering pipeline to reduce computational costs while maintaining data integrity. It begins by using a cost-effective verification strategy to select reliable seed samples from a candidate pool, which are then used to train a data classifier. Positive seeds are sourced from LLM annotations, curated datasets, textbooks, and synthesized content, while negatives come from diverse corpora. Classifier training avoids over-iteration, focusing instead on high-quality seed selection. A fastText-based classifier is used for scalable filtering, offering competitive performance at significantly lower inference costs compared to LLM-based methods, with preprocessing steps ensuring balanced, clean data input. 

    The models were trained using MegatronLM with the MiniCPM-1.2 B architecture on 100B tokens. Evaluations used Lighteval across English and Chinese benchmarks. The results show that models trained on Ultra-FineWeb consistently outperformed those trained on FineWeb and FineWeb-edu, individually and in mixed-language settings. Ultra-FineWeb-en achieved the highest English average score, while Ultra-FineWeb-zh improved performance on Chinese tasks. Ablation studies revealed that Ultra-FineWeb maintains balanced token lengths and benefits from efficient filtering strategies, highlighting its superior quality and effectiveness in improving model performance. 

    In conclusion, the study presents Ultra-FineWeb, a high-quality multilingual dataset comprising about 1 trillion English tokens and 120 billion Chinese tokens. Built upon FineWeb and Chinese FineWeb, it leverages a novel, efficient data filtering pipeline featuring a fastText-based lightweight classifier and a low-cost verification strategy. The pipeline enhances filtering accuracy, reduces reliance on manual seed data selection, and ensures robust performance with minimal computational overhead. Experimental results show that models trained on Ultra-FineWeb consistently outperform those trained on earlier datasets, demonstrating improved performance across benchmarks. The methodology ensures reproducibility and offers valuable insights for optimizing data quality in future LLM training. 


    Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

    The post Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Step-by-Step Guide to Build an Automated Knowledge Graph Pipeline Using LangGraph and NetworkX
    Next Article Sectricity RedSOC Platform

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 19, 2025
    Machine Learning

    Streamline employee training with an intelligent chatbot powered by Amazon Q Business

    August 19, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Malicious Go Modules Deliver Disk-Wiping Linux Malware in Advanced Supply Chain Attack

    Development

    South Korea’s antitrust watchdog green lights Microsoft’s practice of bundling Copilot

    Operating Systems

    CVE-2025-8774 – Riscv-boom SonicBOOM L1 Data Cache Handler Timing Discrepancy Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Red Hat Enterprise Linux (RHEL) quietly released an official image for WSL — but most of us won’t be able to use it

    News & Updates

    Highlights

    CVE-2025-7840 – Campcodes Online Movie Theater Seat Reservation System Cross-Site Scripting Vulnerability

    July 19, 2025

    CVE ID : CVE-2025-7840

    Published : July 19, 2025, 6:15 p.m. | 5 hours, 38 minutes ago

    Description : A vulnerability was found in Campcodes Online Movie Theater Seat Reservation System 1.0. It has been classified as problematic. This affects an unknown part of the file /index.php?page=reserve of the component Reserve Your Seat Page. The manipulation of the argument Firstname/Lastname leads to cross site scripting. It is possible to initiate the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 3.5 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    empress – MPRIS media controls made simple

    August 8, 2025

    CVE-2025-32459 – Quantenna Wi-Fi Command Injection Vulnerability

    June 8, 2025

    Employee arrested after Brazil’s central bank service provider hacked for US $140 million

    July 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.