Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»EPFL Researchers Releases 4M: An Open-Source Training Framework to Advance Multimodal AI

    EPFL Researchers Releases 4M: An Open-Source Training Framework to Advance Multimodal AI

    January 7, 2025

    Multimodal foundation models are becoming increasingly relevant in artificial intelligence, enabling systems to process and integrate multiple forms of data—such as images, text, and audio—to address diverse tasks. However, these systems face significant challenges. Existing models often struggle to generalize across a wide variety of modalities and tasks due to their reliance on limited datasets and modalities. Additionally, the architecture of many current models suffers from negative transfer, where performance on certain tasks deteriorates as new modalities are added. These challenges hinder scalability and the ability to deliver consistent results, underscoring the need for frameworks that can unify diverse data representations while preserving task performance.

    Researchers at EPFL have introduced 4M, an open-source framework designed to train versatile and scalable multimodal foundation models that extend beyond language. 4M addresses the limitations of existing approaches by enabling predictions across diverse modalities, integrating data from sources such as images, text, semantic features, and geometric metadata. Unlike traditional frameworks that cater to a narrow set of tasks, 4M expands to support 21 modalities, three times more than many of its predecessors.

    A core innovation of 4M is its use of discrete tokenization, which converts diverse modalities into a unified sequence of tokens. This unified representation allows the model to leverage a Transformer-based architecture for joint training across multiple data types. By simplifying the training process and removing the need for task-specific components, 4M achieves a balance between scalability and efficiency. As an open-source project, it is accessible to the broader research community, fostering collaboration and further development.

    Technical Details and Advantages

    The 4M framework utilizes an encoder-decoder Transformer architecture tailored for multimodal masked modeling. During training, modalities are tokenized using specialized encoders suited to their data types. For instance, image data employs spatial discrete VAEs, while text and structured metadata are processed using a WordPiece tokenizer. This consistent approach to tokenization ensures seamless integration of diverse data types.

    One notable feature of 4M is its capability for fine-grained and controllable data generation. By conditioning outputs on specific modalities, such as human poses or metadata, the model provides a high degree of control over the generated content. Additionally, 4M’s cross-modal retrieval capabilities allow for queries in one modality (e.g., text) to retrieve relevant information in another (e.g., images).

    The framework’s scalability is another strength. Trained on large datasets like COYO700M and CC12M, 4M incorporates over 0.5 billion samples and scales up to three billion parameters. By compressing dense data into sparse token sequences, it optimizes memory and computational efficiency, making it a practical choice for complex multimodal tasks.

    Results and Insights

    The capabilities of 4M are evident in its performance across various tasks. In evaluations, it demonstrated robust performance across 21 modalities without compromising results compared to specialized models. For instance, 4M’s XL model achieved a semantic segmentation mIoU score of 48.1, matching or exceeding benchmarks while handling three times as many tasks as earlier models.

    Hostinger

    The framework also excels in transfer learning. Tests on downstream tasks, such as 3D object detection and multimodal semantic segmentation, show that 4M’s pretrained encoders maintain high accuracy across both familiar and novel tasks. These results highlight its potential for applications in areas like autonomous systems and healthcare, where integrating multimodal data is critical.

    Conclusion

    The 4M framework marks a significant step forward in the development of multimodal foundation models. By tackling scalability and cross-modal integration challenges, EPFL’s contribution sets the stage for more flexible and efficient AI systems. Its open-source release encourages the research community to build on this work, pushing the boundaries of what multimodal AI can achieve. As the field evolves, frameworks like 4M will play a crucial role in enabling new applications and advancing the capabilities of AI.


    Check out the Paper, Project Page, GitHub Page, Demo, and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post EPFL Researchers Releases 4M: An Open-Source Training Framework to Advance Multimodal AI appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticlerlxOS – independent, safely mutable and privacy oriented Linux distribution
    Next Article HBI V2: A Flexible AI Framework that Elevates Video-Language Learning with a Multivariate Co-Operative Game

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-30421 – NI Circuit Design Suite Stack-Based Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4145 – Netgear EX6200 Remote Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    Ubuntu 24.04 e le vulnerabilità nei Namespace non privilegiati: cosa sapere

    Linux

    amirami/localizator

    Development

    Highlights

    This 360-degree camera is my ultimate travel accessory – with AI features that creatives would want

    April 22, 2025

    Insta360 continues to update its great 360-degree cameras offering an all-around great vlogging tool. Source:…

    Hunyuan-DiT: A Text-to-Image Diffusion Transformer with Fine-Grained Understanding of Both English and Chinese

    May 23, 2024

    Master Database Management Systems

    May 21, 2025

    I’m a Linux power user, and the latest Ubuntu update put a smile on my face

    April 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.