Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 13, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 13, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 13, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 13, 2025

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025

      How to install and use Ollama to run AI LLMs on your Windows 11 PC

      May 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025
      Recent

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025

      How We Use Epic Branches. Without Breaking Our Flow.

      May 13, 2025

      I think the ergonomics of generators is growing on me.

      May 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025
      Recent

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design

    ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design

    December 1, 2024

    Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual understanding. However, they face significant challenges in fine-grained perception tasks such as object detection, which is critical for applications like autonomous driving and robotic navigation. Current models fail to achieve precise detection, reflected in the low recall rates of even state-of-the-art systems like Qwen2-VL, which only manages 43.9% of the COCO dataset. This gap emerges from inherent conflicts of tasks associated with perception and understanding and limited datasets that would be able to fairly balance these two required parts.

    Traditional efforts toward incorporating perception into MLLMs usually involve tokenizing the coordinates of a bounding box to fit this form with auto-regressive models. Though these techniques guarantee compatibility with understanding tasks, they suffer from cascading errors, ambiguous object prediction orders, and quantization inaccuracies in complex images. A retrieval-based perception framework is, for instance, as in Groma and Shikra; it could change the detection of an object but isn’t as strong as a robust real-world task on diverse tasks. Moreover, the mentioned limitations are added to insufficient training datasets, which fail to address the twin requirements of perception and understanding.

    To overcome this challenge, researchers from the International Digital Economy Academy (IDEA) developed ChatRex, an advanced MLLM that is designed with decoupled architecture with strict separation between perception and understanding tasks. ChatRex is built on a retrieval-based framework where object detection is considered as retrieving bounding box indices rather than a direct coordinate prediction. This novel formulation removes quantization errors and increases the accuracy of detection. A Universal Proposal Network (UPN) was developed to generate comprehensive fine-grained and coarse-grained bounding box proposals that addressed ambiguities in object representation. The architecture further integrates a dual-vision encoder, which integrates high-resolution and low-resolution visual features to enhance the precision of object tokenization. The training was further enhanced by the newly developed Rexverse-2M dataset, an enormous collection of annotated images with multi-granular annotations, thus ensuring balanced training across perception and understanding tasks.

    The Universal Proposal Network is based on DETR. The UPN generates robust bounding box proposals at multiple levels of granularity, which has effectively mitigated inconsistencies in object labeling across datasets. The UPN can then accurately detect objects in different scenarios by using fine-grained and coarse-grained prompts during training. The dual-vision encoder enables the encoding of visuals to be done compactly and efficiently by replacing high-resolution image features with low-resolution representations. The dataset for training, Rexverse-2M, contains more than two million annotated images, along with region descriptions, bounding boxes, and captions, which balanced the perception of the understanding and contextual analysis of ChatRex.

    ChatRex performs top-notch in both perception and understanding benchmarks as it surpasses all other present models. In object detection, it has better or higher precision, recall, and mean Average Precision, or mAP, score than competitors on datasets including COCO and LVIS. In referring to object detection, can accurately associate descriptive expressions to corresponding objects, which explains its ability to deal with complex interactions between textual inputs and visual inputs. The system excels further in generating grounded image captions, answering region-specific queries, and object-aware conversational scenarios. This success stems from its decoupled architecture, retrieval-based detection strategy, and the broad training enabled by the Rexverse-2M dataset.

    ChatRex is the first multimodal AI model that resolves the long-standing conflict between perception and understanding tasks. Its innovative design, combined with a robust training dataset, sets a new standard for MLLMs, allowing for precise object detection and context-rich understanding. These dual capabilities open up novel applications in dynamic and complex environments, illustrating how the integration of perception and understanding can unlock the full potential of multimodal systems.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

    🎙 🚨 ‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

    The post ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUnderstanding the Agnostic Learning Paradigm for Neural Activations
    Next Article Meta AI Releases Llama Guard 3-1B-INT4: A Compact and High-Performance AI Moderation Model for Human-AI Conversations

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 13, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3744 – Nomad Sentinel Policy Bypass

    May 13, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Enhancing Deep Learning-Based Neuroimaging Classification with 3D-to-2D Knowledge Distillation

    Development

    The anatomy of a React Island

    Development

    GPT-4o update gets recalled by OpenAI for being too agreeable

    News & Updates

    Microsoft is killing off Windows 11’s Win + C shortcut as Copilot becomes a web app

    Development
    Hostinger

    Highlights

    SonicWall Confirms Active Exploitation of SMA 100 Vulnerabilities – Urges Immediate Patching

    May 1, 2025

    SonicWall Confirms Active Exploitation of SMA 100 Vulnerabilities – Urges Immediate Patching

    On April 29, 2025, SonicWall issued an urgent update to two previously disclosed vulnerabilities affecting its SMA 100 Series appliances, confirming that both flaws are now actively being exploited in …
    Read more

    Published Date:
    May 01, 2025 (2 hours, 14 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2024-10442

    CVE-2024-40766

    CVE-2024-38475

    CVE-2023-44221

    How to Budget Smartly for Your First AI Project: A Step-by-Step Guide💡

    April 30, 2025

    CVE-2024-6032 – Tesla Model S Iris Modem Command Injection Code Execution Vulnerability

    April 30, 2025

    World Password Day: Top 10 Password Managers for Ultimate Digital Safety

    May 2, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.