Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025

      Alienware’s rumored laptop could be the first to feature NVIDIA’s revolutionary Arm-based APU

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025
      Recent

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025

      From Kitchen To Conversion

      June 3, 2025

      Perficient Included in Forrester’s AI Technical Services Landscape, Q2 2025

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025
      Recent

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Content-Adaptive Tokenizer (CAT): An Image Tokenizer that Adapts Token Count based on Image Complexity, Offering Flexible 8x, 16x, or 32x Compression

    Content-Adaptive Tokenizer (CAT): An Image Tokenizer that Adapts Token Count based on Image Complexity, Offering Flexible 8x, 16x, or 32x Compression

    January 10, 2025

    One of the major hurdles in AI-driven image modeling is the inability to account for the diversity in image content complexity effectively. The tokenization methods so far used are static compression ratios where all images are treated equally, and the complexities of images are not considered. Due to this reason, complex images get over-compressed and lead to the loss of crucial information, while simple images remain under-compressed, wasting valuable computational resources. These inefficiencies hinder the performance of subsequent operations such as reconstruction and generation of images, in which accurate and efficient representation plays a critical role.

    Current techniques for tokenizing images do not address the variation in complexity appropriately. Fixed ratio tokenization approaches resize images to standard sizes without considering the varying complexity of contents. Vision Transformers adapt patch size dynamically but rely on image input and do not have flexibility with text-to-image applications. Other compression techniques include JPEG, which is designed specifically for traditional media but lacks optimization for deep learning-based tokenization. Current work, ElasticTok, has offered random token length strategies but lacked consideration of the intrinsic content complexity during training time; this leads to inefficiencies regarding quality and the computational cost associated.

    Researchers from  Carnegie Mellon University and Meta propose Content-Adaptive Tokenization (CAT), a pioneering framework for content-aware image tokenization that introduces a dynamic approach by allocating representation capacity based on content complexity. This innovation enables large language models to test the complexity of images from captions and perception-based queries while classifying images into three compression levels: 8x, 16x, and 32x. In addition, it uses a nested VAE architecture that generates variable-length latent features by dynamically routing intermediate outputs based on the complexity of the images. The adaptive design reduces training overhead and optimizes image representation quality to overcome the inefficiencies of fixed-ratio methods. CAT enables adaptive and efficient tokenization using text-based complexity analysis without requiring image inputs at inference.

    CAT evaluates complexity with captions produced from LLMs that consider both semantic, visual, and perceptual features while determining compression ratios. Such a caption-based system is seen to be greater than traditional methods, including JPEG size and MSE in its ability to mimic human perceived importance. This adaptive nested VAE design does so with the channel-matched skip connections dynamically altering latent space across various compression levels. Shared parameterization guarantees consistency across scales, while training is performed by a combination of reconstruction error, perceptual loss (for example, LPIPS), and adversarial loss to reach optimal performance. CAT was trained on a dataset of 380 million images and tested on the benchmarks of COCO, ImageNet, CelebA, and ChartQA, thus showing its applicability to different image types.

    This achieves highly significant performance improvements over both image reconstruction and generation by adapting compression based on content complexity. For reconstruction tasks, it significantly improves the rFID, LPIPS, and PSNR metrics. It delivers 12% quality improvement for the reconstruction of CelebA and 39% enhancement for ChartQA, all while keeping the quality comparable to those of datasets such as COCO and ImageNet with fewer tokens and efficiency. For class-conditional ImageNet generation, CAT outperforms the fixed-ratio baselines with an FID of 4.56 and improves inference throughput by 18.5%. This adaptive tokenization framework is the new benchmark for further improvement.

    CAT is a new approach to image tokenization by dynamically modulating compression levels based on the complexity of the content. It integrates LLM-based assessments with an adaptive nested VAE, eliminating persistent inefficiencies associated with fixed-ratio tokenization, thereby significantly improving performance in reconstruction and generation tasks. The adaptability and effectiveness of CAT make it a revolutionary asset in AI-oriented image modeling, with potential applications extending to video and multi-modal domains.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post Content-Adaptive Tokenizer (CAT): An Image Tokenizer that Adapts Token Count based on Image Complexity, Offering Flexible 8x, 16x, or 32x Compression appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCan LLMs Design Good Questions Based on Context? This AI Paper Evaluates Questions Generated by LLMs from Context, Comparing Them to Human-Generated Questions
    Next Article Democratizing AI: Implementing a Multimodal LLM-Based Multi-Agent System with No-Code Platforms for Business Automation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    Distillation Scaling Laws

    June 3, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Retro is a customizable clock widget

    Linux

    My favorite robot mower adds two more affordable ‘mini’ units for smaller yards

    Development

    Windows Central’s Best of CES 2025 awards: The hottest hardware unveiled in Las Vegas

    News & Updates

    11 Samsung Galaxy phone settings I always change right away – here’s why

    News & Updates

    Highlights

    How to scroll horizontally knowing that the required tab has no unique ID (Appium using Java)

    June 6, 2024

    What I want to do is scrolling horizontally in a tab that doesn’t have a unique ID. I have code how to scroll using id and textmatches like this example:

    driver.findElement (MobileBy.AndroidUIAutomator(“new UiScrollable(new UiSelector()
    .resourceId(” + Container +”))
    .setAsHorizontalList().scrollIntoView(“+ “new UiSelector()
    .textMatches(” + Textmatch + “).instance(0))”));

    So what should I do if I don’t have the resource-id and textMatches?

    DAT Linux is a distribution targeted at data science

    April 6, 2025

    Samsung’s $99 Galaxy Watch is the best WearOS deal right now – here’s how to qualify

    June 26, 2024

    Salesforce CEO claims Microsoft repackaged ChatGPT as Copilot and “disappointed our customers with what they call Copilot — they’re an OpenAI reseller.”

    March 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.