Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Elastic simplifies log analytics for SREs and developers with launch of Log Essentials

      August 7, 2025

      OpenAI launches GPT-5

      August 7, 2025

      Melissa brings its data quality solutions to Azure with new SSIS integration

      August 7, 2025

      Automating Design Systems: Tips And Resources For Getting Started

      August 6, 2025

      This $180 mini projector has no business being this good for the price

      August 7, 2025

      GPT-5 is finally here, and you can access it for free today – no subscription needed

      August 7, 2025

      Changing this Android setting instantly doubled my phone speed (Samsung and Google models included)

      August 7, 2025

      ChatGPT can now talk nerdy to you – plus more personalities and other upgrades beyond GPT-5

      August 7, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Advanced Application Architecture through Laravel’s Service Container Management

      August 7, 2025
      Recent

      Advanced Application Architecture through Laravel’s Service Container Management

      August 7, 2025

      Switch Between Personas in Laravel With the MultiPersona Package

      August 7, 2025

      AI-Driven Smart Tagging and Metadata in AEM Assets

      August 7, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Bill Gates on AI’s Impact: ‘Be Curious, Read, and Use the Latest Tools’

      August 7, 2025
      Recent

      Bill Gates on AI’s Impact: ‘Be Curious, Read, and Use the Latest Tools’

      August 7, 2025

      Halo Infinite’s Fall Update: New Features and Modes to Revive the Game?

      August 7, 2025

      Forza Motorsport’s Future in Jeopardy: Fans Demand Clarity from Microsoft

      August 7, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

    This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

    June 21, 2025

    Multimodal LLMs: Expanding Capabilities Across Text and Vision

    Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants.

    The Challenge of Text-Only Forgetting in MLLMs

    However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks.

    Limitations of Existing Mitigation Strategies

    Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence.

    Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

    Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically.

    Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

    The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding.

    WINGS Performance Benchmarks Across Text and Multimodal Tasks

    In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale.

    Conclusion: Toward More Balanced and Generalizable MLLMs

    In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDisentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
    Next Article Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 7, 2025
    Machine Learning

    Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments

    August 7, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Essential Git Command Reference: The Core Operations Every Developer Needs

    Development

    12.2TB of User Data Exposed in Passion.io Breach: Over 3.6 Million Records Left Unprotected

    Security

    CVE-2025-7558 – “Code-projects Voting System SQL Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Robloxplayer.exe Explained: Safe to Use or Malware Risk?

    Operating Systems

    Highlights

    CVE-2025-3994 – TOTOLINK N150RT Cross-Site Scripting Vulnerability

    April 28, 2025

    CVE ID : CVE-2025-3994

    Published : April 28, 2025, 1:15 a.m. | 1 hour, 49 minutes ago

    Description : A vulnerability was found in TOTOLINK N150RT 3.4.0-B20190525. It has been classified as problematic. Affected is an unknown function of the file /home.htm of the component IP Port Filtering. The manipulation of the argument Comment leads to cross site scripting. It is possible to launch the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 2.4 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Automate video insights for contextual advertising using Amazon Bedrock Data Automation

    April 17, 2025

    27 College Fonts for Creating Academic-Inspired Designs (2025)

    July 28, 2025

    Rilasciato Sculpt OS 25.04: Nuova Versione del Sistema Operativo Sicuro Basato su Genode

    May 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.