Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration

    Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration

    January 18, 2025

    Modern image and video generation methods rely heavily on tokenization to encode high-dimensional data into compact latent representations. While advancements in scaling generator models have been substantial, tokenizers—primarily based on convolutional neural networks (CNNs)—have received comparatively less attention. This raises questions about how scaling tokenizers might improve reconstruction accuracy and generative tasks. Challenges include architectural limitations and constrained datasets, which affect scalability and broader applicability. There is also a need to understand how design choices in auto-encoders influence performance metrics such as fidelity, compression, and generation.

    Researchers from Meta and UT Austin have addressed these issues by introducing ViTok, a Vision Transformer (ViT)-based auto-encoder. Unlike traditional CNN-based tokenizers, ViTok employs a Transformer-based architecture enhanced by the Llama framework. This design supports large-scale tokenization for images and videos, overcoming dataset constraints by training on extensive and diverse data.

    ViTok focuses on three aspects of scaling:

    1. Bottleneck scaling: Examining the relationship between latent code size and performance.
    2. Encoder scaling: Evaluating the impact of increasing encoder complexity.
    3. Decoder scaling: Assessing how larger decoders influence reconstruction and generation.

    These efforts aim to optimize visual tokenization for both images and videos by addressing inefficiencies in existing architectures.

    Technical Details and Advantages of ViTok

    ViTok uses an asymmetric auto-encoder framework with several distinctive features:

    1. Patch and Tubelet Embedding: Inputs are divided into patches (for images) or tubelets (for videos) to capture spatial and spatiotemporal details.
    2. Latent Bottleneck: The size of the latent space, defined by the number of floating points (E), determines the balance between compression and reconstruction quality.
    3. Encoder and Decoder Design: ViTok employs a lightweight encoder for efficiency and a more computationally intensive decoder for robust reconstruction.

    By leveraging Vision Transformers, ViTok improves scalability. Its enhanced decoder incorporates perceptual and adversarial losses to produce high-quality outputs. Together, these components enable ViTok to:

    • Achieve effective reconstruction with fewer computational FLOPs.
    • Handle image and video data efficiently, taking advantage of the redundancy in video sequences.
    • Balance trade-offs between fidelity (e.g., PSNR, SSIM) and perceptual quality (e.g., FID, IS).

    Results and Insights

    ViTok’s performance was evaluated using benchmarks such as ImageNet-1K, COCO for images, and UCF-101 for videos. Key findings include:

    • Bottleneck Scaling: Increasing bottleneck size improves reconstruction but can complicate generative tasks if the latent space is too large.
    • Encoder Scaling: Larger encoders show minimal benefits for reconstruction and may hinder generative performance due to increased decoding complexity.
    • Decoder Scaling: Larger decoders enhance reconstruction quality, but their benefits for generative tasks vary. A balanced design is often required.

    Results highlight ViTok’s strengths in efficiency and accuracy:

    • State-of-the-art metrics for image reconstruction at 256p and 512p resolutions.
    • Improved video reconstruction scores, demonstrating adaptability to spatiotemporal data.
    • Competitive generative performance in class-conditional tasks with reduced computational demands.

    Conclusion

    ViTok offers a scalable, Transformer-based alternative to traditional CNN tokenizers, addressing key challenges in bottleneck design, encoder scaling, and decoder optimization. Its robust performance across reconstruction and generation tasks highlights its potential for a wide range of applications. By effectively handling both image and video data, ViTok underscores the importance of thoughtful architectural design in advancing visual tokenization.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

    The post Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleSalesforce AI Research Proposes PerfCodeGen: A Training-Free Framework that Enhances the Performance of LLM-Generated Code with Execution Feedback
    Next Article CrewAI: A Guide to Agentic AI Collaboration and Workflow Optimization with Code Implementation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    IIT Kanpur, CSJMU Launch Online Cyber Security Program for 50,000 Students

    Development

    CVE-2025-47649 – Ilmosys Open Close WooCommerce Store Path Traversal Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Hiring Kit: Network Engineer

    News & Updates

    Use Amazon Neptune Analytics to analyze relationships in your data faster, Part 2: Enhancing fraud detection with Parquet and CSV import and export

    Databases

    Highlights

    Revolutionizing Next-Generation Advanced Text-to-Image AI Model

    March 21, 2025

    Post Content Source: Read More 

    Google Pixel 9a vs. iPhone 16e: My camera comparison has a clear winner

    April 23, 2025

    Fixing Focus Visibility Issues for ADA Compliance and Discovering PowerMapper Testing Tool

    February 12, 2025

    Android Wi-Fi Direct bug means hackers can reboot your device

    April 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.