Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      In-House vs Outsourcing for React.js Development: Understand What Is Best for Your Enterprise

      July 17, 2025

      Tiny Screens, Big Impact: The Forgotten Art Of Developing Web Apps For Feature Phones

      July 16, 2025

      Kong AI Gateway 3.11 introduces new method for reducing token costs

      July 16, 2025

      Native vs hybrid vs cross-platform: Resolving the trilemma

      July 16, 2025

      Microsoft’s AI CEO says Google nearly launched “ChatGPT” before OpenAI — but brutal skeptics, fears of disrupting search, and safety concerns thwarted the plan

      July 17, 2025

      You’ve got to try these 5 premium Minecraft add-ons — Dinosaurs, security systems, and more really shake up Bedrock Edition

      July 17, 2025

      This Microsoft pay scale reveals AI pros are making bank — with compensation packages reaching up to $336,000/year

      July 17, 2025

      ZeniMax QA testers face whiplash and “rancid” work morale following Microsoft’s gaming layoffs — but the union still fights

      July 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 17, 2025
      Recent

      The details of TC39’s last meeting

      July 17, 2025

      Vector Search Embeddings and RAG

      July 16, 2025

      Python Meets Power Automate: Trigger via URL

      July 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s AI CEO says Google nearly launched “ChatGPT” before OpenAI — but brutal skeptics, fears of disrupting search, and safety concerns thwarted the plan

      July 17, 2025
      Recent

      Microsoft’s AI CEO says Google nearly launched “ChatGPT” before OpenAI — but brutal skeptics, fears of disrupting search, and safety concerns thwarted the plan

      July 17, 2025

      You’ve got to try these 5 premium Minecraft add-ons — Dinosaurs, security systems, and more really shake up Bedrock Edition

      July 17, 2025

      This Microsoft pay scale reveals AI pros are making bank — with compensation packages reaching up to $336,000/year

      July 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation

    This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation

    May 28, 2025

    Diffusion models, known for their success in generating high-quality images, are now being explored as a foundation for handling diverse data types. These models denoise data and reconstruct original content from noisy inputs. This ability makes diffusion models promising for multimodal tasks involving discrete data, such as text, and continuous data, such as images.

    The challenge in multimodal models is building systems that can handle understanding and generation across text and images without using separate methods or architectures. Existing models often struggle to balance these tasks effectively. They are designed for specific tasks like image generation or question answering, which results in limited performance in unified tasks. Post-training techniques that could further align models across reasoning and generation tasks are also underdeveloped, leaving a gap in fully integrated multimodal models that can handle diverse challenges using a single design.

    Popular approaches like Show-o, Janus, and SEED-X combine autoregressive models for text and diffusion models for images, requiring separate loss functions and architectures. These models use distinct tokenization schemes and separate pipelines for text and image tasks, complicating training and limiting their ability to handle reasoning and generation in a unified way. Furthermore, they focus heavily on pretraining strategies, overlooking post-training methods that could help these models learn to reason across different data types.

    Researchers from Princeton University, Peking University, Tsinghua University, and ByteDance have introduced MMaDA, a unified multimodal diffusion model. This system integrates textual reasoning, visual understanding, and image generation into a probabilistic framework. MMaDA uses a shared diffusion architecture without relying on modality-specific components, simplifying training across different data types. The model’s design allows it to process textual and visual data together, enabling a streamlined, cohesive approach for reasoning and generation tasks.

    The MMaDA system introduces a mixed long chain-of-thought (Long-CoT) finetuning strategy that aligns reasoning steps across text and image tasks. The researchers curated a diverse dataset of reasoning traces, such as problem-solving in mathematics and visual question answering, to guide the model in learning complex reasoning across modalities. They also developed UniGRPO, a reinforcement learning algorithm tailored for diffusion models, which uses policy gradients and diversified reward signals, including correctness, format adherence, and alignment with visual content. The model’s training pipeline incorporates a uniform masking strategy and structured denoising steps, ensuring stability during learning and allowing the model to reconstruct content across different tasks effectively.

    In performance benchmarks, MMaDA demonstrated strong results across diverse tasks. It achieved a CLIP score of 32.46 for text-to-image generation and an ImageReward of 1.15, outperforming models like SDXL and Janus. In multimodal understanding, it reached a POPE score of 86.1, an MME score of 1410.7, and a Flickr30k score of 67.6, surpassing systems such as Show-o and SEED-X. For textual reasoning, MMaDA scored 73.4 on GSM8K and 36.0 on MATH500, outperforming other diffusion-based models like LLaDA-8B. These results highlight MMaDA’s capacity to deliver consistent, high-quality outputs across reasoning, understanding, and generation tasks.

    Overall, MMaDA provides a practical solution to the challenges of building unified multimodal models by introducing a simplified architecture and innovative training techniques. The research shows that diffusion models can excel as general-purpose systems capable of reasoning and generation across multiple data types. By addressing the limitations of existing models, MMaDA offers a blueprint for developing future AI systems that seamlessly integrate different tasks in a single, robust framework.


    Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-48289 – AncoraThemes Kids Planet Deserialization of Untrusted Data Object Injection Vulnerability
    Next Article L’adozione di RISC-V nelle distribuzioni Enterprise Linux

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 17, 2025
    Machine Learning

    Accenture scales video analysis with Amazon Nova and Amazon Bedrock Agents

    July 16, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    This $125 tablet actually feels premium – here’s why it’s in my travel kit

    News & Updates

    CVE-2025-47931 – LibreNMS Stored Cross-Site Scripting (XSS) Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-49652 – Lablup BackendAI Missing Authentication Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-6913 – PHPGurukul Student Record System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-3486 – Allegra ZipEntry Valide Directory Traversal Remote Code Execution Vulnerability

    May 22, 2025

    CVE ID : CVE-2025-3486

    Published : May 22, 2025, 1:15 a.m. | 1 hour, 35 minutes ago

    Description : Allegra isZipEntryValide Directory Traversal Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Allegra. Authentication is required to exploit this vulnerability.

    The specific flaw exists within the implementation of the isZipEntryValide method. The issue results from the lack of proper validation of a user-supplied path prior to using it in file operations. An attacker can leverage this vulnerability to execute code in the context of LOCAL SERVICE. Was ZDI-CAN-25730.

    Severity: 7.2 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Hobbit-inspired sword can help you find unsecured WiFi hotspots

    April 9, 2025

    CVE-2025-6582 – SourceCodester Best Salon Management System SQL Injection Vulnerability

    June 24, 2025

    CVE-2025-52888 – “Allure Report XXE Injection Vulnerability”

    June 24, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.