Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs)

Deep learning models like Convolutional Neural Networks (CNNs) and Vision Transformers achieved great success in many visual tasks, such as image classification, object detection, and semantic segmentation. However, their ability to handle different changes in data is still a big concern, especially for use in security-critical applications. Many works evaluated the robustness of CNNs and Transformers against common corruptions, domain shifts, information drops, and adversarial attacks. It shows that a modelâ€™s design affects its ability to manage these issues, and robustness varies across different architectures. A major drawback of transformers is their quadratic computational scaling with input size, making them costly for complex tasks.

This paper discussed two related topics: the Robustness of Deep Learning Models (RDLM) and State Space Models (SSMs). RDLM focuses on how well a traditionally trained model can maintain good performance if faced with natural and adversarial changes in data distribution. Deep learning models often face data corruption, like noise, blur, compression artifacts, and intentional disruptions designed to trick the model in real-world situations. These issues can significantly harm their performance, so, to ensure these models are reliable and robust, it is important to evaluate their performance under these tough conditions. On the other hand, SSMs are a promising approach for modeling sequential data in deep learning. These models transform a one-dimensional sequence using an implicit latent state.

Researchers from MBZUAI UAE, Linkoping University, and ANU Australia have introduced a comprehensive analysis of the performance of VSSMs, Vision Transformers, and CNNs. This analysis can manage various challenges for classification, detection, and segmentation tasks, and provides valuable insights into their robustness and suitability for real-world applications. The evaluations performed by researchers are divided into three parts, each focusing on an important area of model robustness. The first part is Occlusions and Information Loss, where the robustness of VSSMs is evaluated against information loss along scanning directions and occlusions. The other two parts are Common Corruptions and Adversarial Attacks.

The robustness of classification models based on VSSM is tested against Common Corruptions that reflect real-world situations. These include global corruptions like noise, blur, weather, and digital distortions at different intensity levels, and detailed corruptions such as object attribute editing and background changes. The evaluation is then extended to VSSM-based detection and segmentation models to show their strength in dense prediction tasks. Moreover, the robustness of VSSMs is analyzed against the third and last section, Adversarial Attacks in both white-box and black-box settings. This analysis gives insights into the ability of VSSMs to resist adversarial changes at various frequency levels.

Based on the evaluation of all the three sections, here are the key findings:

In the first part, it is found that ConvNext and VSSM models handle sequential information loss along the scanning direction, better than ViT and Swin models. In situations that involve patch drops, VSSMs show the highest robustness, although Swin models perform better under extreme information loss.Â

VSSM models experience the smallest average performance drop compared to Swin and ConvNext models in global corruption. For fine-grained corruptions, VSSM models outperform all transformer-based variants and either match.

For adversarial attacks, smaller VSSM models show great robustness against white-box attacks compared to their Swin Transformer counterparts. VSSM models keep above 90% robustness for strong low-frequency perturbations, but their performance drops quickly with high-frequency attacks.

In conclusion, researchers thoroughly evaluated the robustness of Vision State-Space Models (VSSMs) under various natural and adversarial disturbances, showing their strengths and weaknesses compared to transformers and CNNs. The experiments revealed the capabilities and limitations of VSSMs in handling occlusions, common corruptions, and adversarial attacks, as well as their ability to adapt to changes in object-background composition in complex visual scenes. This study will guide future research to enhance the reliability and effectiveness of visual perception systems in real-world situations.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs) appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Looking for an AI-powered website builder? Here’s your best option in 2025

SteamOS is officially not just for Steam Deck anymore — now ready for Lenovo Legion Go S and sort of ready for the ROG Ally

Microsoft’s latest AI model can accurately forecast the weather: “It doesn’t know the laws of physics, so it could make up something completely crazy”

OpenAI scientists wanted “a doomsday bunker” before AGI surpasses human intelligence and threatens humanity

A timeline of JavaScript’s history

A timeline of JavaScript’s history

Loading JSON Data into Snowflake From Local Directory

Streamline Conditional Logic with Laravel’s Fluent Conditionable Trait

Open-Typer is a typing tutor application

Open-Typer is a typing tutor application

RefreshOS is a distribution built on the robust foundation of Debian

Cosmicding is a client to manage your linkding bookmarks

Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs)

Markus Buehler receives 2025 Washington Award

LWiAI Podcast #201 – GPT 4.5, Sonnet 3.7, Grok 3, Phi 4

Unable to locate pseudo element using javascript executor in Selenium JAVA

Introducing AWS MCP Servers for code assistants (Part 1)

plakativ stretches PDF or raster image across multiple pages

Sophisticated IIS Malware Targets South Korean Web Servers

Ubuntu Adds Official Support for NVIDIA Jetson AI Modules

Debunking the AI Hype: Inside Real Hacker Tactics

InfHow: Learn how to do anything

How to Write Like Shakespeare: A Comprehensive Guide

Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs)

Related Posts