MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

Artificial intelligence, particularly in training large multimodal models (LMMs), relies heavily on vast datasets that include sequences of images and text. These datasets enable the development of sophisticated models capable of understanding and generating multimodal content. As AI modelsâ€™ capabilities advance, the need for extensive, high-quality datasets becomes even more critical, driving researchers to explore new data collection and curation methods.

A significant challenge in AI research is the need for large-scale, open-source, multimodal interleaved datasets. These datasets are essential for training models seamlessly integrating text and image data. The limited availability of such datasets hampers the development of robust and high-performing open-source models, resulting in a performance gap between open-source and proprietary models. Addressing this gap requires innovative approaches to dataset creation that can provide the necessary scale and diversity.

Existing methods for creating multimodal datasets often involve collecting and curating data from HTML documents. Notable datasets like OBELICS have been instrumental but are limited in scale and diversity, primarily sourcing data from HTML. This restriction affects the variety and richness of the data, impacting the performance and applicability of the resulting AI models. Researchers have found that datasets sourced solely from HTML documents must capture the full spectrum of required multimodal content for comprehensive model training.

Researchers from the University of Washington, Salesforce Research, Stanford University, the University of Texas at Austin, and the University of California Berkeley introduced MINT-1T, the most extensive & diverse open-source multimodal interleaved dataset to date, addressing the need for larger and more varied datasets. MINT-1T comprises one trillion text tokens and 3.4 billion images from HTML, PDFs, and ArXiv papers. This dataset represents a tenfold increase from previous datasets, significantly enhancing the data for training multimodal models. Institutions such as the University of Washington and Salesforce Research collaborated on this initiative, demonstrating a concerted effort to bridge the gap in dataset availability.

Creating the MINT-1T dataset involved an intricate process of sourcing, filtering, and deduplicating data. HTML documents were expanded to include data from earlier years, and PDFs were processed to extract readable text and images. ArXiv papers were parsed for figures and text, ensuring a comprehensive collection of multimodal content. Advanced filtering methods were employed to remove low-quality, non-English, and inappropriate content. Deduplication processes were also implemented to eliminate repetitive data, ensuring the datasetâ€™s quality and diversity.

Experiments demonstrated that LMMs trained on the MINT-1T dataset matched and often surpassed the performance of models trained on previous leading datasets like OBELICS. Including more diverse sources in MINT-1, T resulted in better generalization and performance across various benchmarks. Notably, the dataset significantly improved performance in tasks involving visual question answering and multimodal reasoning. The researchers found that models trained on MINT-1T performed better across multiple demonstrations, highlighting the datasetâ€™s effectiveness.

The MINT-1T datasetâ€™s construction included detailed steps to ensure data quality and diversity. For instance, the dataset consists of 922 billion HTML tokens, 106 billion PDF tokens, and 9 billion ArXiv tokens. The filtering process involved eliminating documents with inappropriate content and non-English texts, using tools like Fasttext for language identification and NSFW detectors for image content. The deduplication process was crucial, involving Bloom filters to remove duplicate paragraphs and documents and hashing techniques to eliminate repetitive images.

In conclusion, the MINT-1T dataset addresses dataset scarcity and diversity. By introducing a larger and more varied dataset, the researchers have enabled the development of more robust and high-performing open-source multimodal models. This work highlights the importance of data diversity and scale in AI research and paves the way for future improvements and applications in multimodal AI. The datasetâ€™s extensive scale, including one trillion text tokens and 3.4 billion images, provides a solid foundation for advancing AI capabilities.

Check out the Paper, Details, and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 47k+ ML SubReddit

Find Upcoming AI Webinars here

Breaking news! We just released the MINT-1T dataset! One trillion tokens. Multimodal. Interleaved. Open-source. Perfect for training multimodal models and advancing their pre-training. Try it today!

Blog: https://t.co/e36YvEBrcP
Dataset: https://t.co/FHKhkAURdN pic.twitter.com/guqup91SBW

â€” Salesforce AI Research (@SFResearch) July 24, 2024

The post MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database

Where can I get a bunch of regex with some samples of their matches, non matches, for a wide test? [closed]

Tenmon is a FITS and XISF image viewer, converter and indexer

Intel Arc B570: Buone Prestazioni Grafiche sui sistemi GNU/Linux

A community collaboration for progress

I tested DJI’s latest flagship drone, and it’s straight from the future (with one caveat)

Microsoft is making it easier to share files between Windows and Android – here’s how

MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

Related Posts