Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Google’s Agent2Agent protocol finds new home at the Linux Foundation

      June 23, 2025

      Decoding The SVG path Element: Curve And Arc Commands

      June 23, 2025

      This week in AI dev tools: Gemini 2.5 Pro and Flash GA, GitHub Copilot Spaces, and more (June 20, 2025)

      June 20, 2025

      Gemini 2.5 Pro and Flash are generally available and Gemini 2.5 Flash-Lite preview is announced

      June 19, 2025

      Best early Prime Day Nintendo Switch deals: My 17 favorite sales live now

      June 23, 2025

      How I use VirtualBox to run any OS on my Mac – including Linux

      June 23, 2025

      Apple will give you a free pair of AirPods when you buy a MacBook or iPad for school – here’s who’s eligible

      June 23, 2025

      How Apple’s biggest potential acquisition ever could perplex AI rivals like Google

      June 23, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Music Streaming Platform using PHP and MySQL

      June 23, 2025
      Recent

      Music Streaming Platform using PHP and MySQL

      June 23, 2025

      Solutions That Benefit Everyone – Why Inclusive Design Matters for All

      June 23, 2025

      Reducing Barriers Across Industries Through Inclusive Design

      June 23, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 Installation Assistant Download: 2025 Guide

      June 23, 2025
      Recent

      Windows 11 Installation Assistant Download: 2025 Guide

      June 23, 2025

      Didn’t Receive Gears of War: Reloaded Code? Explainer

      June 23, 2025

      Fix Vibrant Visuals Greyed Out in Minecraft Bedrock

      June 23, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities

    Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities

    May 8, 2025

    LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch using methods like autoregressive token prediction or a hybrid approach combining diffusion and language losses has shown strong performance. Still, it requires vast computational resources and retraining for each new modality. An alternative approach adapts pretrained LLMs with vision capabilities, which offers a more efficient path but often compromises the language model’s original performance.

    Current research has focused on three main strategies: merging LLMs with standalone image generation models, training large multimodal models end-to-end, or using a combination of diffusion and autoregressive losses. While these methods have achieved state-of-the-art results, they either require retraining large models or result in degradation of the LLM’s core capabilities. Despite these challenges, leveraging pretrained LLMs with added vision components has demonstrated significant potential, particularly in tasks involving image understanding and generation. However, these methods still face limitations in terms of efficiency and flexibility. 

    Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, which adapts pretrained LLMs for multimodal tasks while preserving language capabilities. X-Fusion utilizes a dual-tower architecture, freezing the LLM’s language weights while adding a vision-specific tower to process visual information. The approach aligns text and vision features at multiple levels, improving performance in image-to-text and text-to-image tasks. Through ablation studies, the researchers emphasize the importance of clean image data for training and show that aligning vision features with pre-trained representations accelerates convergence, especially for smaller models. 

    X-Fusion is a unified framework that adapts pretrained LLMs for vision tasks while retaining their language capabilities. It uses a dual-tower design, freezing the LLM’s text weights while introducing a separate vision tower for processing visual information. Images are tokenized using a pretrained encoder, and image and text tokens are jointly optimized. The model incorporates an optional X-Fuse operation to merge features from both towers for enhanced performance. X-Fusion is trained with autoregressive and image denoising losses, and its performance is evaluated on image generation (text-to-image) and image understanding (image-to-text) tasks. 

    The study evaluates the Dual Tower architecture against alternative transformer variants for multimodal integration. It compares the Single Tower, Gated Tower, and Dual Projection designs, highlighting the flexibility of the Dual Tower for image and text tasks. The Dual Tower performs best in image generation and understanding, outperforming other designs by 23% in FID without increasing training parameters. The study also investigates the effects of noise and data ratios on performance, finding that clean images improve understanding and generation. Additionally, aligning vision features with a pretrained encoder like CLIP boosts performance, especially for smaller models. 

    In conclusion, X-Fusion is a framework that adapts pretrained LLMs to multimodal tasks, such as image understanding and generation, while preserving language capabilities. It introduces a Dual Tower architecture where language weights remain fixed, and a separate trainable vision tower processes visual features. Experimental results show that X-Fusion outperforms alternative designs in image and text-to-image tasks. Key findings include the benefits of incorporating understanding-focused data, reducing noise in image data, and the positive impact of feature alignment, especially for smaller models. The research contributes valuable insights into building efficient multimodal models. 


    Check out the Paper. Also, don’t forget to follow us on Twitter.

    Here’s a brief overview of what we’re building at Marktechpost:

    • Newsletter– airesearchinsights.com/(30k+ subscribers)
    • miniCON AI Events – minicon.marktechpost.com
    • AI Reports & Magazines – magazine.marktechpost.com
    • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
    • ML News Community – r/machinelearningnews (92k+ members)

    The post Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleSonicWall Issues Patch for Exploit Chain in SMA Devices
    Next Article How to Use AI to Enhance Your WordPress Blog

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 23, 2025
    Machine Learning

    New AI Framework Evaluates Where AI Should Automate vs. Augment Jobs, Says Stanford Study

    June 23, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4495 – JAdmin-JAVA Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    How can we build human values into AI?

    Artificial Intelligence

    Solving LLM Hallucinations in Conversational, Customer-Facing Use Cases

    Machine Learning

    CVE-2025-52967 – MLflow Gateway Proxy Handler Path Validation Bypass

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-47532 – CoinPayments.net Payment Gateway for WooCommerce Object Injection Vulnerability

    May 24, 2025

    CVE ID : CVE-2025-47532

    Published : May 23, 2025, 1:15 p.m. | 15 hours, 14 minutes ago

    Description : Deserialization of Untrusted Data vulnerability in CoinPayments CoinPayments.net Payment Gateway for WooCommerce allows Object Injection. This issue affects CoinPayments.net Payment Gateway for WooCommerce: from n/a through 1.0.17.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Distribution Release: Ubuntu Budgie 25.04

    April 17, 2025

    CVE-2025-5806 – Jenkins Gatling Plugin Cross-Site Scripting Vulnerability

    June 6, 2025

    CVE-2025-2777 – SysAid On-Prem XXE Vulnerability

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.