Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      In-House vs. Outsource Node.js Development Teams: 9 Key Differences for the C-Suite (2025)

      July 19, 2025

      Why Non-Native Content Designers Improve Global UX

      July 18, 2025

      DevOps won’t scale without platform engineering and here’s why your teams are still stuck

      July 18, 2025

      This week in AI dev tools: Slack’s enterprise search, Claude Code’s analytics dashboard, and more (July 18, 2025)

      July 18, 2025

      I ditched my Bluetooth speakers for this slick turntable – and it’s more practical than I thought

      July 19, 2025

      This split keyboard offers deep customization – if you’re willing to go all in

      July 19, 2025

      I spoke with an AI version of myself, thanks to Hume’s free tool – how to try it

      July 19, 2025

      I took a walk with Meta’s new Oakley smart glasses – they beat my Ray-Bans in every way

      July 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 19, 2025
      Recent

      The details of TC39’s last meeting

      July 19, 2025

      Simple wrapper for Chrome’s built-in local LLM (Gemini Nano)

      July 19, 2025

      Online Examination System using PHP and MySQL

      July 18, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025
      Recent

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025

      10 Best Windows 11 Encryption Software

      July 19, 2025

      Google Chrome Is Testing Dynamic Country Detection for Region-Specific Features

      July 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

    Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

    April 18, 2025

    The Challenge of Designing General-Purpose Vision Encoders

    As AI systems grow increasingly multimodal, the role of visual perception models becomes more complex. Vision encoders are expected not only to recognize objects and scenes, but also to support tasks like captioning, question answering, fine-grained recognition, document parsing, and spatial reasoning across both images and videos. Existing models typically rely on diverse pretraining objectives—contrastive learning for retrieval, captioning for language tasks, and self-supervised methods for spatial understanding. This fragmentation complicates scalability and model deployment, and introduces trade-offs in performance across tasks.

    What remains a key challenge is the design of a unified vision encoder that can match or exceed task-specific methods, operate robustly in open-world scenarios, and scale efficiently across modalities.

    A Unified Solution: Meta AI’s Perception Encoder

    Meta AI introduces Perception Encoder (PE), a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks. PE departs from the traditional multi-objective pretraining paradigm. Instead, it demonstrates that with a carefully tuned training recipe and appropriate alignment methods, contrastive learning alone can yield highly generalizable visual representations.

    The Perception Encoder operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters. These models are designed to function as general-purpose encoders for both image and video inputs, offering strong performance in classification, retrieval, and multimodal reasoning.

    Training Approach and Architecture

    The pretraining of PE follows a two-stage process. The first stage involves robust contrastive learning on a large-scale curated image-text dataset (5.4B pairs), where several architectural and training enhancements improve both accuracy and robustness. These include progressive resolution scaling, large batch sizes (up to 131K), use of the LAMB optimizer, 2D RoPE positional encoding, tuned augmentations, and masked regularization.

    The second stage introduces video understanding by leveraging a video data engine that synthesizes high-quality video-text pairs. This pipeline incorporates captions from the Perception Language Model (PLM), frame-level descriptions, and metadata, which are then summarized using Llama 3.3. These synthetic annotations allow the same image encoder to be fine-tuned for video tasks via frame averaging.

    Despite using a single contrastive objective, PE features general-purpose representations distributed across intermediate layers. To access these, Meta introduces two alignment strategies:

    • Language alignment for tasks such as visual question answering and captioning.
    • Spatial alignment for detection, tracking, and depth estimation, using self-distillation and spatial correspondence distillation via SAM2.

    Empirical Performance Across Modalities

    PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks. On image classification, PEcoreG matches or exceeds proprietary models trained on large private datasets such as JFT-3B. It achieves:

    • 86.6% on ImageNet-val,
    • 92.6% on ImageNet-Adversarial,
    • 88.2% on the full ObjectNet set,
    • Competitive results on fine-grained datasets including iNaturalist, Food101, and Oxford Flowers.

    In video tasks, PE achieves state-of-the-art performance on zero-shot classification and retrieval benchmarks, outperforming InternVideo2 and SigLIP2-g-opt, while being trained on just 22M synthetic video-caption pairs. The use of simple average pooling across frames—rather than temporal attention—demonstrates that architectural simplicity, when paired with well-aligned training data, can still yield high-quality video representations.

    An ablation study shows that each component of the video data engine contributes meaningfully to performance. Improvements of +3.9% in classification and +11.1% in retrieval over image-only baselines highlight the utility of synthetic video data, even at modest scale.

    Conclusion

    Perception Encoder provides a technically compelling demonstration that a single contrastive objective, if implemented with care and paired with thoughtful alignment strategies, is sufficient to build general-purpose vision encoders. PE not only matches specialized models in their respective domains but does so with a unified and scalable approach.

    The release of PE, along with its codebase and the PE Video Dataset, offers the research community a reproducible and efficient foundation for building multimodal AI systems. As visual reasoning tasks grow in complexity and scope, PE provides a path forward toward more integrated and robust visual understanding.


    Check out the Paper, Model, Code and Dataset. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleStream ingest data from Kafka to Amazon Bedrock Knowledge Bases using custom connectors
    Next Article LLM Unlearning Benchmarks are Weak Measures of Progress

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 19, 2025
    Machine Learning

    Language Models Improve When Pretraining Data Matches Target Tasks

    July 18, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-21479 – NVIDIA GPU Unauthenticated Command Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    I back up everything on this rugged 4TB SSD — here’s why I recommend it at 40% off

    News & Updates

    FBC: Firebreak release date revealed — Remedy details co-op “Control” spin-off shooter filled with evil sticky note monsters

    News & Updates

    FastCGI Library Vulnerability Exposes Embedded Devices to Code Execution Attacks

    Security

    Highlights

    CVE-2025-37799 – vmxnet3 Linux Kernel Malformed Packet Sizing Vulnerability

    May 3, 2025

    CVE ID : CVE-2025-37799

    Published : May 3, 2025, 12:15 p.m. | 5 hours, 16 minutes ago

    Description : In the Linux kernel, the following vulnerability has been resolved:

    vmxnet3: Fix malformed packet sizing in vmxnet3_process_xdp

    vmxnet3 driver’s XDP handling is buggy for packet sizes using ring0 (that
    is, packet sizes between 128 – 3k bytes).

    We noticed MTU-related connectivity issues with Cilium’s service load-
    balancing in case of vmxnet3 as NIC underneath. A simple curl to a HTTP
    backend service where the XDP LB was doing IPIP encap led to overly large
    packet sizes but only for *some* of the packets (e.g. HTTP GET request)
    while others (e.g. the prior TCP 3WHS) looked completely fine on the wire.

    In fact, the pcap recording on the backend node actually revealed that the
    node with the XDP LB was leaking uninitialized kernel data onto the wire
    for the affected packets, for example, while the packets should have been
    152 bytes their actual size was 1482 bytes, so the remainder after 152 bytes
    was padded with whatever other data was in that page at the time (e.g. we
    saw user/payload data from prior processed packets).

    We only noticed this through an MTU issue, e.g. when the XDP LB node and
    the backend node both had the same MTU (e.g. 1500) then the curl request
    got dropped on the backend node’s NIC given the packet was too large even
    though the IPIP-encapped packet normally would never even come close to
    the MTU limit. Lowering the MTU on the XDP LB (e.g. 1480) allowed to let
    the curl request succeed (which also indicates that the kernel ignored the
    padding, and thus the issue wasn’t very user-visible).

    Commit e127ce7699c1 (“vmxnet3: Fix missing reserved tailroom”) was too eager
    to also switch xdp_prepare_buff() from rcd->len to rbi->len. It really needs
    to stick to rcd->len which is the actual packet length from the descriptor.
    The latter we also feed into vmxnet3_process_xdp_small(), by the way, and
    it indicates the correct length needed to initialize the xdp->{data,data_end}
    parts. For e127ce7699c1 (“vmxnet3: Fix missing reserved tailroom”) the
    relevant part was adapting xdp_init_buff() to address the warning given the
    xdp_data_hard_end() depends on xdp->frame_sz. With that fixed, traffic on
    the wire looks good again.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    AMD’s most popular gaming CPU drops to a pre-2025 price with this deal

    April 28, 2025

    CVE-2025-44194 – SourceCodester Simple Barangay Management System SQL Injection Vulnerability

    April 30, 2025

    Windows 11 new Start menu won’t let you create new Categories, clubs apps as “Other”

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.