Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»FineWeb-C: A Community-Built Dataset For Improving Language Models In ALL Languages

    FineWeb-C: A Community-Built Dataset For Improving Language Models In ALL Languages

    December 25, 2024

    FineWeb2 significantly advances multilingual pretraining datasets, covering over 1000 languages with high-quality data. The dataset uses approximately 8 terabytes of compressed text data and contains nearly 3 trillion words, sourced from 96 CommonCrawl snapshots between 2013 and 2024. Processed using the datatrove library, FineWeb2 demonstrates superior performance compared to established datasets like CC-100, mC4, CulturaX, and HPLT across nine diverse languages. The ablation and evaluation setup is present in this github repo.

    Huggingface community researchers introduced FineWeb-C, a collaborative, community-driven project that expands upon FineWeb2 to create high-quality educational content annotations across hundreds of languages. The project enables community members to rate web content’s educational value and identify problematic elements through the Argilla platform. Languages achieving 1,000 annotations qualify for dataset inclusion. This annotation process serves dual purposes: identifying high-quality educational content and improving LLM development across all languages.

    318 Hugging Face community members have submitted 32,863 annotations, contributing to developing high-quality LLMs across underrepresented languages. FineWeb-Edu is a dataset built upon the original FineWeb dataset and employs an educational quality classifier trained on LLama3-70B-Instruct annotations to identify and retain the most educational content. This approach has proven successful, outperforming FineWeb on popular benchmarks while reducing the data volume needed for training effective LLMs. The project aims to extend FineWeb-Edu’s capabilities to all world languages by collecting community annotations to train language-specific educational quality classifiers.

    The project prioritizes human-generated annotations over LLM-based ones, particularly for low-resource languages where LLM performance cannot be reliably validated. This community-driven approach parallels Wikipedia’s collaborative model, emphasizing open access and democratization of AI technology. Contributors join a broader movement to break language barriers in AI development, as commercial companies typically focus on profitable languages. The dataset’s open nature enables anyone to build AI systems tailored to specific community needs while facilitating learning about effective approaches across different languages.

    The FineWeb-Edu uses multiple annotations per page for some languages, allowing flexible calculation of annotator agreement. Quality control measures include plans to increase annotation overlap in heavily annotated languages. The data contains a boolean column ‘problematic_content_label_present’ to identify pages with problematic content flags, often resulting from incorrect language detection. Users can filter content based on either individual problematic labels or annotator agreement through the ‘problematic_content_label_agreement’ column. The dataset operates under the ODC-By v1.0 license and CommonCrawl’s Terms of Use.

    In conclusion, FineWeb2’s community-driven extension, FineWeb-C, has gathered 32,863 annotations from 318 contributors, focusing on educational content labeling. The project demonstrates superior performance compared to existing datasets with less training data through FineWeb-Edu’s specialized educational content classifier. Unlike commercial approaches, this open-source initiative prioritizes human annotations over LLM-based ones, particularly for low-resource languages. The dataset features robust quality control measures, including multiple annotation layers and problematic content filtering, while operating under the ODC-By v1.0 license.


    Check out the details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post FineWeb-C: A Community-Built Dataset For Improving Language Models In ALL Languages appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDistribution Release: MakuluLinux 2024-12-22
    Next Article Qwen Team Releases QvQ: An Open-Weight Model for Multimodal Reasoning

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    LibreWolf vs Firefox: Which One is Better For Your Privacy?

    Operating Systems

    CVE-2025-21453 – Adobe Flash Memory Corruption Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Sam Altman’s ouster as OpenAI CEO was reportedly a cocktail of deception and toxicity, with Microsoft at the center of it all

    News & Updates

    Newpark Resources Hit by Ransomware Attack, Disrupting Key Systems

    Development
    GetResponse

    Highlights

    Development

    Challenges Faced By Data Centers In Adopting Liquid Cooling

    June 30, 2024

    By Emily Newton Data center liquid cooling systems are increasingly common due to their superior…

    ESET Research Podcast: Telekopye, again

    December 22, 2024

    TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration

    March 24, 2025

    Minimal Customizable Confirm Dialog Hook For React – useConfirm

    June 4, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.