Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»BRIDGETOWER: A Novel Transformer-based Vision-Language VL Model that Takes Full Advantage of the Features of Different Layers in Pre-Trained Uni-Modal Encoders

    BRIDGETOWER: A Novel Transformer-based Vision-Language VL Model that Takes Full Advantage of the Features of Different Layers in Pre-Trained Uni-Modal Encoders

    June 3, 2024

    Vision-and-language (VL) representation learning is an evolving field focused on integrating visual and textual information to enhance machine learning models’ performance across a variety of tasks. This integration enables models to understand and process images and text simultaneously, improving outcomes such as image captioning, visual question answering (VQA), and image-text retrieval.

    A significant challenge in VL representation learning is effectively aligning and fusing information from visual and textual modalities. Traditional methods often process visual and textual data separately before combining them, which can result in incomplete or suboptimal interactions between the modalities. This limitation hinders the ability of models to fully utilize the rich semantic information present in both visual and textual data, thereby affecting their performance and adaptability to different tasks.

    Existing work includes uni-modal encoders that process visual and textual data separately before combining them, often leading to incomplete cross-modal interactions. Models like METER and ALBEF utilize this approach but need help in fully exploiting the semantic richness across modalities. ALIGN and similar frameworks integrate visual and textual data at later stages, which can hinder comprehensive alignment and fusion of information. While effective to some extent, these methods need help with achieving optimal performance due to their separate handling of visual and textual representations.

    Researchers from Microsoft and Google have introduced BRIDGETOWER, a novel transformer-based model designed to improve cross-modal alignment and fusion. BRIDGETOWER incorporates multiple bridge layers that connect the top layers of uni-modal encoders with each layer of the cross-modal encoder. This innovative design enables more effective bottom-up alignment of visual and textual representations, enhancing the model’s ability to combine these data types seamlessly.

    BRIDGETOWER employs bridge layers to integrate visual and textual information at different semantic levels, enhancing the cross-modal encoder’s ability to combine these data types effectively. These bridge layers utilize a LayerNorm function to merge inputs from uni-modal encoders, allowing for more nuanced and detailed interactions across the model’s layers. The method leverages pre-trained uni-modal encoders and introduces multiple bridge layers to connect these encoders with the cross-modal encoder. This approach facilitates a bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels, thereby enabling a more effective and informative cross-modal interaction at each encoder layer.

    The performance of BRIDGETOWER has been evaluated extensively across various vision-language tasks, and the results have been remarkable. On the MSCOCO dataset, BRIDGETOWER achieved an RSUM of 498.9%, outperforming the previous state-of-the-art model, METER, by 2.8%. For the image retrieval task, BRIDGETOWER scored 62.4% for IR@1, significantly surpassing METER by 5.3%. It also outperformed the ALIGN and ALBEF models, which were pre-trained with much larger datasets. Regarding text retrieval, BRIDGETOWER achieved 75.0% for TR@1, which is slightly lower than METER by 1.2%. On the VQAv2 test-std set, BRIDGETOWER attained an accuracy of 78.73%, outperforming METER by 1.09% with the same pre-training data and nearly negligible additional parameters and computational costs. When scaling the model further, BRIDGETOWER achieved an accuracy of 81.15% on the VQAv2 test-std set, surpassing models pre-trained on significantly larger datasets.

    In conclusion, the research introduces BRIDGETOWER, a novel model designed to enhance vision and language tasks by integrating multiple bridge layers that connect uni-modal and cross-modal encoders. By enabling effective alignment and fusion of visual and textual data, BRIDGETOWER outperforms existing models like METER in various tasks such as image retrieval and visual question answering. The model’s ability to achieve state-of-the-art performance with minimal additional computational cost demonstrates its potential for advancing the field. This work underscores the importance of efficient cross-modal interactions for improving the accuracy and scalability of vision-and-language models.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

    The post BRIDGETOWER: A Novel Transformer-based Vision-Language VL Model that Takes Full Advantage of the Features of Different Layers in Pre-Trained Uni-Modal Encoders appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Explores the Extent to which LLMs can Self-Improve their Performance as Agents in Long-Horizon Tasks in a Complex Environment Using the WebArena Benchmark
    Next Article Aligning Large Language Models with Diverse User Preferences Using Multifaceted System Messages: The JANUS Approach

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Call of Duty: Black Ops 6 was so successful that it actually made US customers spend less money buying games

    Development

    The best early Prime Day Samsung deals

    Development

    CVE-2025-3769 – LatePoint WordPress Calendar Booking Plugin Insecure Direct Object Reference Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Is Apple finally going to make a TV set? Maybe. Here’s what it’ll depend on

    Development
    Hostinger

    Highlights

    News & Updates

    Avowed: How to switch from first-person to third-person perspective

    February 13, 2025

    Here’s how to swap to your favorite viewpoint on the fly. Source: Read More / Windows…

    Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks

    April 16, 2024

    Automatic language detection improvements: increased accuracy & expanded language support

    August 29, 2024

    Rilasciato Mozilla Firefox 137: Ecco le Novità

    April 2, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.