Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

    MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

    July 26, 2024

    Artificial intelligence, particularly in training large multimodal models (LMMs), relies heavily on vast datasets that include sequences of images and text. These datasets enable the development of sophisticated models capable of understanding and generating multimodal content. As AI models’ capabilities advance, the need for extensive, high-quality datasets becomes even more critical, driving researchers to explore new data collection and curation methods.

    A significant challenge in AI research is the need for large-scale, open-source, multimodal interleaved datasets. These datasets are essential for training models seamlessly integrating text and image data. The limited availability of such datasets hampers the development of robust and high-performing open-source models, resulting in a performance gap between open-source and proprietary models. Addressing this gap requires innovative approaches to dataset creation that can provide the necessary scale and diversity.

    Existing methods for creating multimodal datasets often involve collecting and curating data from HTML documents. Notable datasets like OBELICS have been instrumental but are limited in scale and diversity, primarily sourcing data from HTML. This restriction affects the variety and richness of the data, impacting the performance and applicability of the resulting AI models. Researchers have found that datasets sourced solely from HTML documents must capture the full spectrum of required multimodal content for comprehensive model training.

    Researchers from the University of Washington, Salesforce Research, Stanford University, the University of Texas at Austin, and the University of California Berkeley introduced MINT-1T, the most extensive & diverse open-source multimodal interleaved dataset to date, addressing the need for larger and more varied datasets. MINT-1T comprises one trillion text tokens and 3.4 billion images from HTML, PDFs, and ArXiv papers. This dataset represents a tenfold increase from previous datasets, significantly enhancing the data for training multimodal models. Institutions such as the University of Washington and Salesforce Research collaborated on this initiative, demonstrating a concerted effort to bridge the gap in dataset availability.

    Creating the MINT-1T dataset involved an intricate process of sourcing, filtering, and deduplicating data. HTML documents were expanded to include data from earlier years, and PDFs were processed to extract readable text and images. ArXiv papers were parsed for figures and text, ensuring a comprehensive collection of multimodal content. Advanced filtering methods were employed to remove low-quality, non-English, and inappropriate content. Deduplication processes were also implemented to eliminate repetitive data, ensuring the dataset’s quality and diversity.

    Experiments demonstrated that LMMs trained on the MINT-1T dataset matched and often surpassed the performance of models trained on previous leading datasets like OBELICS. Including more diverse sources in MINT-1, T resulted in better generalization and performance across various benchmarks. Notably, the dataset significantly improved performance in tasks involving visual question answering and multimodal reasoning. The researchers found that models trained on MINT-1T performed better across multiple demonstrations, highlighting the dataset’s effectiveness.

    The MINT-1T dataset’s construction included detailed steps to ensure data quality and diversity. For instance, the dataset consists of 922 billion HTML tokens, 106 billion PDF tokens, and 9 billion ArXiv tokens. The filtering process involved eliminating documents with inappropriate content and non-English texts, using tools like Fasttext for language identification and NSFW detectors for image content. The deduplication process was crucial, involving Bloom filters to remove duplicate paragraphs and documents and hashing techniques to eliminate repetitive images.

    In conclusion, the MINT-1T dataset addresses dataset scarcity and diversity. By introducing a larger and more varied dataset, the researchers have enabled the development of more robust and high-performing open-source multimodal models. This work highlights the importance of data diversity and scale in AI research and paves the way for future improvements and applications in multimodal AI. The dataset’s extensive scale, including one trillion text tokens and 3.4 billion images, provides a solid foundation for advancing AI capabilities.

    Check out the Paper, Details, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    Breaking news! We just released the MINT-1T dataset! One trillion tokens. Multimodal. Interleaved. Open-source. Perfect for training multimodal models and advancing their pre-training. Try it today!

    Blog: https://t.co/e36YvEBrcP
    Dataset: https://t.co/FHKhkAURdN pic.twitter.com/guqup91SBW

    — Salesforce AI Research (@SFResearch) July 24, 2024

    The post MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHarvard Researchers Unveil ReXrank: An Open-Source Leaderboard for AI-Powered Radiology Report Generation from Chest X-ray Images
    Next Article This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

    Development

    A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database

    Machine Learning

    Where can I get a bunch of regex with some samples of their matches, non matches, for a wide test? [closed]

    Development

    Tenmon is a FITS and XISF image viewer, converter and indexer

    Linux

    Highlights

    Linux

    Intel Arc B570: Buone Prestazioni Grafiche sui sistemi GNU/Linux

    January 17, 2025

    Il mese scorso, Intel ha ufficialmente presentato le schede grafiche della serie Battlemage, tra cui…

    A community collaboration for progress

    May 22, 2024

    I tested DJI’s latest flagship drone, and it’s straight from the future (with one caveat)

    May 13, 2025

    Microsoft is making it easier to share files between Windows and Android – here’s how

    August 16, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.