Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

      July 24, 2025

      Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

      July 24, 2025

      AI and its impact on the developer experience, or ‘where is the joy?’

      July 23, 2025

      Google launches OSS Rebuild tool to improve trust in open source packages

      July 23, 2025

      EcoFlow’s new portable battery stations are lighter and more powerful (DC plug included)

      July 24, 2025

      7 ways Linux can save you money

      July 24, 2025

      My favorite Kindle tablet just got a kids model, and it makes so much sense

      July 24, 2025

      You can turn your Google Photos into video clips now – here’s how

      July 24, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Blade Service Injection: Direct Service Access in Laravel Templates

      July 24, 2025
      Recent

      Blade Service Injection: Direct Service Access in Laravel Templates

      July 24, 2025

      This Week in Laravel: NativePHP Mobile and AI Guidelines from Spatie

      July 24, 2025

      Retrieve the Currently Executing Closure in PHP 8.5

      July 24, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.30: AUR Poisoned, Linux Rising, PPA Explained, New Open Source Grammar Checker and More

      July 24, 2025
      Recent

      FOSS Weekly #25.30: AUR Poisoned, Linux Rising, PPA Explained, New Open Source Grammar Checker and More

      July 24, 2025

      How to Open Control Panel in Windows 11

      July 24, 2025

      How to Shut Down Windows 11

      July 24, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Meta AI Released the Perception Language Model (PLM): An Open and Reproducible Vision-Language Model to Tackle Challenging Visual Recognition Tasks

    Meta AI Released the Perception Language Model (PLM): An Open and Reproducible Vision-Language Model to Tackle Challenging Visual Recognition Tasks

    April 18, 2025

    Despite rapid advances in vision-language modeling, much of the progress in this field has been shaped by models trained on proprietary datasets, often relying on distillation from closed-source systems. This reliance creates barriers to scientific transparency and reproducibility, particularly for tasks involving fine-grained image and video understanding. Benchmark performance may reflect the training data and black-box model capabilities more than architectural or methodological improvements, making it difficult to assess true research progress.

    To address these limitations, Meta AI has introduced the Perception Language Model (PLM), a fully open and reproducible framework for vision-language modeling. PLM is designed to support both image and video inputs and is trained without the use of proprietary model outputs. Instead, it draws from large-scale synthetic data and newly collected human-labeled datasets, enabling a detailed evaluation of model behavior and training dynamics under transparent conditions.

    The PLM framework integrates a vision encoder (Perception Encoder) with LLaMA 3 language decoders of varying sizes—1B, 3B, and 8B parameters. It employs a multi-stage training pipeline: initial warm-up with low-resolution synthetic images, large-scale midtraining on diverse synthetic datasets, and supervised fine-tuning using high-resolution data with precise annotations. This pipeline emphasizes training stability and scalability while maintaining control over data provenance and content.

    A key contribution of the work is the release of two large-scale, high-quality video datasets addressing existing gaps in temporal and spatial understanding. The PLM–FGQA dataset comprises 2.4 million question-answer pairs capturing fine-grained details of human actions—such as object manipulation, movement direction, and spatial relations—across diverse video domains. Complementing this is PLM–STC, a dataset of 476,000 spatio-temporal captions linked to segmentation masks that track subjects across time, allowing models to reason about “what,” “where,” and “when” in complex video scenes.

    Technically, PLM employs a modular architecture that supports high-resolution image tiling (up to 36 tiles) and multi-frame video input (up to 32 frames). A 2-layer MLP projector connects the visual encoder to the LLM, and both synthetic and human-labeled data are structured to support a wide range of tasks including captioning, visual question answering, and dense region-based reasoning. The synthetic data engine, built entirely using open-source models, generates ~64.7 million samples across natural images, charts, documents, and videos—ensuring diversity while avoiding reliance on proprietary sources.

    Meta AI also introduces PLM–VideoBench, a new benchmark designed to evaluate aspects of video understanding not captured by existing benchmarks. It includes tasks such as fine-grained activity recognition (FGQA), smart-glasses video QA (SGQA), region-based dense captioning (RDCap), and spatio-temporal localization (RTLoc). These tasks require models to engage in temporally grounded and spatially explicit reasoning.

    Empirical evaluations show that PLM models, particularly at the 8B parameter scale, perform competitively across 40+ image and video benchmarks. In video captioning, PLM achieves gains of +39.8 CIDEr on average over open baselines. On PLM–VideoBench, the 8B variant closes the gap with human performance in structured tasks such as FGQA and shows improved results in spatio-temporal localization and dense captioning. Notably, all results are obtained without distillation from closed models, underscoring the feasibility of open, transparent VLM development.

    In summary, PLM offers a methodologically rigorous and fully open framework for training and evaluating vision-language models. Its release includes not just models and code, but also the largest curated dataset for fine-grained video understanding and a benchmark suite that targets previously underexplored capabilities. PLM is positioned to serve as a foundation for reproducible research in multimodal AI and a resource for future work on detailed visual reasoning in open settings.


    Here is the Paper, Model and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Meta AI Released the Perception Language Model (PLM): An Open and Reproducible Vision-Language Model to Tackle Challenging Visual Recognition Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous Articlefastmod – fast partial replacement for codemod
    Next Article An In-Depth Guide to Firecrawl Playground: Exploring Scrape, Crawl, Map, and Extract Features for Smarter Web Data Extraction

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 24, 2025
    Machine Learning

    AI Guardrails and Trustworthy LLM Evaluation: Building Responsible AI Systems

    July 24, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Microsoft’s June 2025 Patch Tuesday: 2 Zero-Days, 69 Vulnerabilities Patched!

    Security

    How to Edit Videos in Windows 10 in 5 Steps

    Operating Systems

    CVE-2025-1533 – ASUS Armoury Crate App Stack Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4041 – Optigo Networks ONS NC600 Command Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2024-38822 – Salt Minion Token Validation Bypass

    June 13, 2025

    CVE ID : CVE-2024-38822

    Published : June 13, 2025, 7:15 a.m. | 2 hours, 49 minutes ago

    Description : Multiple methods in the salt master skip minion token validation. Therefore a misbehaving minion can impersonate another minion.

    Severity: 2.7 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-47102 – Adobe Experience Manager DOM-based Cross-Site Scripting (XSS)

    June 10, 2025

    Building a Smart Python-to-R Code Converter with Gemini AI-Powered Validation and Feedback

    July 22, 2025

    CVE-2025-4881 – iSourcecode Restaurant Management System SQL Injection

    May 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.