Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 31, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 31, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 31, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 31, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025
      Recent

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025

      Filament Is Now Running Natively on Mobile

      May 31, 2025

      How Remix is shaking things up

      May 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025
      Recent

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models

    ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models

    January 11, 2025

    The rise of multimodal applications has highlighted the importance of instruction data in training MLMs to handle complex image-based queries effectively. Current practices for generating such data rely on LLMs or MLMs, which, despite their effectiveness, face several challenges. These include high costs, licensing restrictions, and susceptibility to hallucinations—generating inaccurate or unreliable content. Additionally, the generation process is often opaque, making it difficult to customize or interpret outputs, limiting its scalability and reliability. Visual instruction data is crucial for enabling MLMs to respond effectively to user queries about input images, but existing methods for its collection and generation remain constrained by these issues.

    Recent advancements in MLMs, such as the LLaVA and InstructBLIP models, have leveraged multimodal data to achieve remarkable results in visual-language tasks. However, despite significant progress, these models often underperform in vision-specific tasks like depth estimation and localization due to the limited availability of instruction data for such tasks. While most synthetic data methods rely on LLMs, MLMs, or diffusion models, programmatic approaches like those used in GQA and AGQA focus primarily on evaluation. Unlike these methods, newer approaches aim to generate adaptable single- and multi-image instruction data for training, addressing the limitations of existing techniques and broadening the scope of multimodal learning.

    Researchers from the University of Washington, Salesforce Research, and the University of Southern California introduced PROVISION. This scalable programmatic system uses scene graphs as symbolic image representations to generate vision-centric instruction data. By combining human-written programs with automatically or manually created scene graphs, PROVISION ensures interpretability, accuracy, and scalability while avoiding hallucinations and licensing constraints common in LLM/MLM-driven methods. The system generates over 10 million data points (PROVISION-10M) from Visual Genome and DataComp, covering diverse tasks like object, attribute, and depth-based queries. This data improves MLM performance, yielding up to 8% gains on benchmarks like CVBench, QBench2, and Mantis-Eval across pretraining and fine-tuning stages.

    The study introduces a method for generating vision-centric instruction data using augmented scene graphs enhanced with depth and segmentation labels. For single-image scenarios, 24 generators create diverse question-answer pairs using pre-defined templates, focusing on object attributes, relations, and spatial depth. Multi-image generators enable advanced reasoning tasks like comparison and aggregation across scene graphs. The scene graph generation pipeline integrates object detection (YOLO-world), segmentation (SAM-2), attribute detection (finetuned CoCa and LLaVA-1.5), relation extraction (Osprey), and depth estimation (Depth Anything V2). The modular framework supports customization, enabling users to create diverse data for visual reasoning and multimodal AI applications.

    The experiments involve synthesizing instruction data to improve model performance. Results show that manually annotated scene graphs outperform those generated by models, and both data format (short answer vs. multiple choice) and data scale significantly impact outcomes. Incorporating synthesized data in both pre-training and fine-tuning stages yields optimal results. The PROVISION-10M dataset was constructed using Visual Genome’s manually annotated scene graphs and generated scene graphs from high-resolution images, producing over 10 million instruction samples. These were tested in augmentation and replacement settings across various benchmarks, demonstrating the effectiveness of scene graphs for creating useful instructions, whether real or automatically generated.

    Hostinger

    In conclusion, The PROVISION system generates vision-centric instruction data for MLMs using scene graph representations and human-written programs. Applied to Visual Genome and DataComp, it creates PROVISION-10M, a dataset with over 10 million instructions, improving MLM performance during pretraining and instruction tuning. The system uses 24 single-image and 14 multi-image instruction generators, producing diverse queries about objects, attributes, and relationships. PROVISION achieves up to 8% performance on benchmarks like CVBench and Mantis-Eval. While limitations include dependency on scene graph quality and human-written programs, future enhancements may improve automation and scalability using LLMs.


    Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios
    Next Article Wello Solutions

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 31, 2025
    Machine Learning

    Cisco’s Latest AI Agents Report Details the Transformative Impact of Agentic AI on Customer Experience

    May 31, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Neuralinked – The Full Novel by Legend Srinidhi Ranganathan

    Artificial Intelligence

    NYU Researchers Open-Sourced GPUDrive: A GPU-Accelerated Multi-Agent Driving Simulation at 1 Million FPS

    Development

    New to the web platform in August

    Development

    This ionizing air purifier kept my home free of smoke and odors (and it’s not a Levoit)

    Development

    Highlights

    CVE-2025-46778 – Apache HTTP Server Denial of Service

    April 30, 2025

    CVE ID : CVE-2025-46778

    Published : April 30, 2025, 3:15 a.m. | 3 hours, 58 minutes ago

    Description : Rejected reason: Not used

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Topaz Video AI Review: Is It Worth The Price?

    June 17, 2024

    CVE-2024-13914 – “WordPress File Manager Advanced Shortcode Local File Inclusion Vulnerability”

    May 15, 2025

    French Business Email Database List

    January 20, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.