Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Error’d: You Talkin’ to Me?

      September 20, 2025

      The Psychology Of Trust In AI: A Guide To Measuring And Designing For User Confidence

      September 20, 2025

      This week in AI updates: OpenAI Codex updates, Claude integration in Xcode 26, and more (September 19, 2025)

      September 20, 2025

      Report: The major factors driving employee disengagement in 2025

      September 20, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Development Release: Zorin OS 18 Beta

      September 19, 2025

      Distribution Release: IPFire 2.29 Core 197

      September 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      @ts-ignore is almost always the worst option

      September 22, 2025
      Recent

      @ts-ignore is almost always the worst option

      September 22, 2025

      MutativeJS v1.3.0 is out with massive performance gains

      September 22, 2025

      Student Performance Prediction System using Python Machine Learning (ML)

      September 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      DistroWatch Weekly, Issue 1140

      September 21, 2025
      Recent

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Hyprland Made Easy: Preconfigured Beautiful Distros

      September 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

    This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

    June 21, 2025

    Multimodal LLMs: Expanding Capabilities Across Text and Vision

    Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants.

    The Challenge of Text-Only Forgetting in MLLMs

    However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks.

    Limitations of Existing Mitigation Strategies

    Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence.

    Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

    Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically.

    Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

    The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding.

    WINGS Performance Benchmarks Across Text and Multimodal Tasks

    In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale.

    Conclusion: Toward More Balanced and Generalizable MLLMs

    In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDisentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
    Next Article Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Register Now for a Laravel Debugging Workshop by Sentry

    Development

    CVE-2025-32924 – Roninwp Revy SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    KLog – ham radio logger

    Linux

    Rilasciata Clonezilla Live 3.2.2-5 con Kernel Linux 6.12 e Opzioni Ezio Potenziate

    Linux

    Highlights

    Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

    May 14, 2025

    FAQs are great—they give you official answers without having to search the web. Microsoft is…

    CVE-2025-50182 – “urllib3 Browser Redirect Handling Vulnerability”

    June 19, 2025

    Cloudreve – self-hosted file management system with multi-cloud support

    August 28, 2025

    Overwatch 2 Stadium Mode — Best Juno Builds: Best items, powers, and gameplay tips

    April 24, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.