This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Multimodal LLMs: Expanding Capabilities Across Text and Vision

Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants.

The Challenge of Text-Only Forgetting in MLLMs

However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks.

Limitations of Existing Mitigation Strategies

Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence.

Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically.

Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding.

WINGS Performance Benchmarks Across Text and Multimodal Tasks

In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale.

Conclusion: Toward More Balanced and Generalizable MLLMs

In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

Source: Read MoreÂ

Error’d: You Talkin’ to Me?

The Psychology Of Trust In AI: A Guide To Measuring And Designing For User Confidence

This week in AI updates: OpenAI Codex updates, Claude integration in Xcode 26, and more (September 19, 2025)

Report: The major factors driving employee disengagement in 2025

Development Release: Zorin OS 18 Beta

Distribution Release: IPFire 2.29 Core 197

Development Release: Ubuntu 25.10 Beta

Development Release: Linux Mint 7 Beta “LMDE”

@ts-ignore is almost always the worst option

@ts-ignore is almost always the worst option

MutativeJS v1.3.0 is out with massive performance gains

Student Performance Prediction System using Python Machine Learning (ML)

Hyprland Made Easy: Preconfigured Beautiful Distros

Hyprland Made Easy: Preconfigured Beautiful Distros

Development Release: Zorin OS 18 Beta

Distribution Release: IPFire 2.29 Core 197

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Multimodal LLMs: Expanding Capabilities Across Text and Vision

The Challenge of Text-Only Forgetting in MLLMs

Limitations of Existing Mitigation Strategies

Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

WINGS Performance Benchmarks Across Text and Multimodal Tasks

Conclusion: Toward More Balanced and Generalizable MLLMs

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

CVE-2025-40662 – DM Corporative CMS Absolute Path Disclosure

Teller is a multi provider secret management tool

CVE-2025-5523 – Enilu Web-Flash Cross-Site Scripting Vulnerability

CVE-2025-5019 – Hive Support WordPress Cross-Site Request Forgery Vulnerability

AG-UI (Agent-User Interaction Protocol): An Open, Lightweight, Event-based Protocol that Standardizes How AI Agents Connect to Front-End Applications

The second annual Triple-I Initiative Showcase featured more than 40 indie games — Here’s every title showcased for Xbox and PC

Europol Targets Qilin Ransomware Group with $50k Reward

The Next-Gen AIOps Doctor Is In: Diagnosing Mainframe Issues Quickly and Intelligently

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Multimodal LLMs: Expanding Capabilities Across Text and Vision

The Challenge of Text-Only Forgetting in MLLMs

Limitations of Existing Mitigation Strategies

Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

WINGS Performance Benchmarks Across Text and Multimodal Tasks

Conclusion: Toward More Balanced and Generalizable MLLMs

Related Posts