Researchers at the Ohio State University Introduce Famba-V: A Cross-Layer Token Fusion Technique that Enhances the Training Efficiency of Vision Mamba Models

The efficient training of vision models is still a major challenge in AI because Transformer-based models suffer from computational bottlenecks due to the quadratic complexity of self-attention mechanisms. Also, the ViTs, although extremely promising results on hard vision tasks, require extensive computational and memory resources, making them impossible to use under real-time or resource-constrained conditions. Recently, SSMs have gained more interest due to their scalability regarding handling long-sequence data. However, even state-of-the-art SSM-based models such as Vision Mamba require very high memory and computation training costs. Efficiently overcoming these limitations will greatly expand the application of AI vision models within areas that demand a delicate balance between accuracy and computational efficiency, such as in autonomous systems or medical imaging.

The current work to enhance the efficiency in the vision models is being carried out mainly for the ViTs, with two such classical techniques. Token pruning is performed to remove several tokens that carry less information, while token merging is carried out to combine the tokens retaining the essence with reduced complexity. Although those methods improve the computational efficiency of Transformer models, they perform less effectively for SSMs, which need to keep long-range dependencies without sacrificing the efficiency of processing sequential data. Besides that, most traditional token fusion techniques perform uniform fusion over all layers, which causes noticeable accuracy loss and may degrade the performance of SSM-based vision models in critical applications. Therefore, these methods failed to provide the models with a desirable balance between accuracy and efficiency, such as Vim.

Researchers from Ohio State University introduce Famba-V, a targeted, cross-layer token fusion strategy for Vision Mamba, which overcomes limitations by selectively applying token fusion across specific layers based on their contribution to efficiency and accuracy.Â Token merging and detection of similar tokens in Famba-V rely on cosine similarity. Three fusion strategies are followed: Interleaved, Lower-layer, and Upper-layer Token Fusion. These strategies target different layers of the Vim model in pursuit of the optimum trade-off between resource efficiency and performance. As an illustration, the Interleaved strategy applies fusion to every other layer. In this case, after a relatively small loss in accuracy, some efficiency gains are won. Another, the Upper-layer strategy, will reduce interference with the initial layers of data processing. This architecture gives the best performance for Famba-V, while it has huge computational and memory efficiency suitable for resource-limited applications.

CIFAR-100 dataset benchmarking was done by Vim-Ti and Vim-S variants of Vim architecture, having 7 million and 26 million parameters respectively. This is done in Famba-V by measuring the token similarity in each chosen layer, averaging them out for similar token pairs. Every strategy was designed to have approximately equal numbers of reduced tokens across different strategies so that comparisons are fair. Experiments were done on the NVIDIA A100 GPU, and a typical training configuration with some standard augmentation-random cropping and label smoothing with mix-up was conducive to robustness in results. By providing various fusion strategies, Famba-V enables the user to choose an appropriate trade-off between accuracy and efficiency regarding their demands on computation.

The application of Famba-V to CIFAR-100 has shown an effective balance between efficiency and accuracy across the models of Vim: for all variants of the models of Vim, training time and memory usage have been drastically reduced, while maintaining accuracy values near the model variants that are not token-fused. The token fusion strategy at the Upper-layer on Vim-S maintained a Top-1 accuracy of 75.2%, which is a minimal reduction from the non-fused baseline of 76.9%, yet significantly economizes on memory usage, thus showing the large potential of the strategy in sustaining model integrity with much better efficiency. Likewise, the Interleaved fusion strategy on Vim-Ti achieved a Top-1 accuracy of 67.0% and reduced training time to less than four hours, once more demonstrating the ability of Famba-V to tailor its configuration to best exploit given resources. This ease in adapting to computational constraints with outstanding performance makes Famba-V practical for such applications that require efficiency with reliability in such resource-constrained settings.

In conclusion, Famba-Vâ€™s cross-layer token fusion framework presents a substantial advancement in training efficiency for Vision Mamba models. Famba-V achieves a flexible balance between accuracy and efficiency by implementing three cross-layer strategies: Interleaved, Lower-layer, and Upper-layer Fusion. It notably reduces training time and memory consumption on CIFAR-100 without sacrificing model robustness. The adaptability in design makes Famba-V of particular value for real-world vision tasks; hence, extending the application of SSM-based models in resource-constrained environments. In addition, further studies can be conducted to see how Famba-V can be integrated with other strategies to enhance the efficiency of finding an even better optimization of SSM-based models for vision tasks.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Researchers at the Ohio State University Introduce Famba-V: A Cross-Layer Token Fusion Technique that Enhances the Training Efficiency of Vision Mamba Models appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

The last day of ’12 days of OpenAI’ is expected to bring biggest drop yet

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

Researchers at the Ohio State University Introduce Famba-V: A Cross-Layer Token Fusion Technique that Enhances the Training Efficiency of Vision Mamba Models

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

Selenium with Chrome mobile emulator- can’t make interactions

Salesforce AI Unveils SFR-Embedding-v2: Reclaiming Top Spot on HuggingFace MTEB Benchmark with Advanced Multitasking and Enhanced Performance in AI

zm â€“ improved cd

If I were a student, I’d be going back to school with my Steam Deck as my main PC â€” here’s how to kit it out for work and play

The Lost White Cat – Bookspotz Chatstories

Development Release: Linux Mint 22.1 Beta

Graphiti: A Python Library for Building Temporal Knowledge Graphs Using LLMs

Anthropic Claude 3.5 Sonnet ranks number 1 for business and finance in S&P AI Benchmarks by Kensho

Researchers at the Ohio State University Introduce Famba-V: A Cross-Layer Token Fusion Technique that Enhances the Training Efficiency of Vision Mamba Models

Related Posts