Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»InternVL 1.5 Advances Multimodal AI with High-Resolution and Bilingual Capabilities in Open-Source Models

    InternVL 1.5 Advances Multimodal AI with High-Resolution and Bilingual Capabilities in Open-Source Models

    April 30, 2024

    Multimodal large language models (MLLMs) integrate text and visual data processing to enhance how artificial intelligence understands and interacts with the world. This area of research focuses on creating systems that can comprehend and respond to a combination of visual cues and linguistic information, mimicking human-like interactions more closely.

    The challenge often lies in the limited capabilities of open-source models compared to their commercial counterparts. Open-source models frequently exhibit deficiencies in processing complex visual inputs and supporting various languages, which can restrict their practical applications and effectiveness in diverse scenarios.

    Historically, most open-source MLLMs have been trained at fixed resolutions, primarily using datasets limited to the English language. This approach significantly hinders their functionality when encountering high-resolution images or content in other languages, making it difficult for these models to perform well in tasks that require detailed visual understanding or multilingual capabilities.

    The research from Shanghai AI Laboratory, SenseTime Research, Tsinghua University, Nanjing University, Fudan University, and The Chinese University of Hong Kong introduces InternVL 1.5, an open-source MLLM designed to significantly enhance the capabilities of open-source systems in multimodal understanding. This model incorporates three major improvements to close the performance gap between open-source and proprietary commercial models. The three main components are:

    Firstly, a strong vision encoder, InternViT-6B, has been optimized through a continuous learning strategy, enhancing its visual understanding capabilities.

    Secondly, a dynamic high-resolution approach allows the model to handle images up to 4K resolution by dynamically adjusting image tiles based on the input’s aspect ratio and resolution. 

    Lastly, a high-quality bilingual dataset has been meticulously assembled, covering common scenes and document images annotated with English and Chinese question-answer pairs. 

    The three steps significantly boost the model’s performance in OCR and Chinese language-related tasks. These enhancements enable InternVL 1.5 to compete robustly in various benchmarks and comparative studies, showcasing its improved effectiveness in multimodal tasks. InternVL 1.5 employs a segmented approach to image handling, allowing it to process images in resolutions up to 4K by dividing them into tiles ranging from 448×448 pixels, adapting dynamically based on the image’s aspect ratio and resolution. This method improves image comprehension and facilitates understanding of detailed scenes and documents. The model’s enhanced linguistic capabilities stem from its training on a diverse dataset comprising both English and Chinese, covering a variety of scenes and document types, which boosts its performance in OCR and text-based tasks across languages.

    The model’s performance is evidenced by its results across multiple benchmarks, where it excels particularly in OCR-related datasets and bilingual scene understanding. InternVL 1.5 demonstrates state-of-the-art results, showing marked improvements over previous versions and surpassing some proprietary models in specific tests. For example, text-based visual question answering achieves an accuracy of 80.6%, and document-based question answering reaches an impressive 90.9%. In multimodal benchmarks that assess models on both visual and textual understanding, InternVL 1.5 consistently delivers competitive results, often outperforming other open-source models and rivaling commercial models.

    In conclusion, InternVL 1.5 addresses the significant challenges that open-source multimodal large language models face, particularly in processing high-resolution images and supporting multilingual capabilities. This model significantly narrows the performance gap with commercial counterparts by implementing a robust vision encoder, dynamic resolution adaptation, and a comprehensive bilingual dataset. The enhanced capabilities of InternVL 1.5 are demonstrated through its superior performance in OCR-related tasks and bilingual scene understanding, establishing it as a formidable competitor in advanced artificial intelligence systems. 

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 40k+ ML SubReddit

    The post InternVL 1.5 Advances Multimodal AI with High-Resolution and Bilingual Capabilities in Open-Source Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle’s Gecko benchmark identifies best AI image generator
    Next Article REBEL: A Reinforcement Learning RL Algorithm that Reduces the Problem of RL to Solving a Sequence of Relative Reward Regression Problems on Iteratively Collected Datasets

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Amazon admits defeat to Steam as former VP declares, “We were 250 times bigger, but Goliath lost.”

    News & Updates

    The latest KB5053649 to the Beta Channel finally fixed one of most frustrating issues with Windows Tools

    Operating Systems

    Use the AWS InfluxDB migration script to migrate your InfluxDB OSS 2.x data to Amazon Timestream for InfluxDB

    Databases

    Microsoft overhauls the sign-in UI, makes it “passwordless and passkey-first”

    Operating Systems

    Highlights

    McAfee unleashes AI deepfake audio detector – but how reliable can it be?

    August 21, 2024

    Altered audio can signal a scam, and Deepfake Detector promises to find them. Here are…

    CVE-2025-35995 – BIG-IP PEM Denial of Service Vulnerability

    May 7, 2025

    How to Declare and Use Global Variables in TypeScript

    January 8, 2025

    Talent in the new normal: How to manage fast-changing tech roles

    November 29, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.