Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      7 MagSafe accessories that I recommend every iPhone user should have

      June 1, 2025

      I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

      June 1, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025
      Recent

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

      June 1, 2025

      Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ViLa-MIL: Enhancing Whole Slide Image Classification with Dual-Scale Vision-Language Multiple Instance Learning

    ViLa-MIL: Enhancing Whole Slide Image Classification with Dual-Scale Vision-Language Multiple Instance Learning

    February 19, 2025

    Whole Slide Image (WSI) classification in digital pathology presents several critical challenges due to the immense size and hierarchical nature of WSIs.  WSIs contain billions of pixels and hence direct observation is computationally infeasible. Current strategies based on multiple instance learning (MIL) are effective in performance but considerably dependent on large amounts of bag-level annotated data, whose acquisition is troublesome, particularly in the case of rare diseases. Moreover, current strategies are strongly based on image insights and encounter generalization issues due to differences in the data distribution across hospitals. Recent improvements in Vision-Language Models (VLMs) introduce linguistic prior through large-scale pretraining from image-text pairs; however, current strategies fall short in addressing domain-specific insights related to pathology. Moreover, the computationally expensive nature of pretraining models and their insufficient adaptability with the hierarchical characteristic specific to pathology are additional setbacks. It is essential to transcend these challenges to promote AI-based cancer diagnosis and proper WSI classification.

    MIL-based methods generally adopt a three-stage pipeline: patch cropping from WSIs, feature extraction with a pre-trained encoder, and patch-level to slide-level feature aggregation to make predictions. Although these methods are effective for pathology-related tasks like cancer subtyping and staging, their dependency on large annotated datasets and data distribution shift sensitivity renders them less practical to use. VLM-based models like CLIP and BiomedCLIP try to tap into language priors by utilizing large-scale image-text pairs gathered from online databases. These models, however, depend on general, handcrafted text prompts that lack the subtlety of pathological diagnosis. In addition, knowledge transfer from vision-language models to WSIs is inefficient owing to the hierarchical and large-scale nature of WSIs, which demands astronomical computational costs and dataset-specific fine-tuning.

    Researchers from Xi’an Jiaotong University, Tencent YouTu Lab, and Institute of High-Performance Computing Singapore introduce a dual-scale vision-language multiple instance learning model capable of efficiently transferring vision-language model knowledge to digital pathology through descriptive text prompts designed specifically for pathology and trainable decoders for image and text branches. In contrast to generic class-name-based prompts for traditional vision-language methods, the model utilizes a frozen large language model to generate domain-specific descriptions at two resolutions. The low-scale prompt highlights global tumor structures, and the high-scale prompt highlights finer cellular details, with improved feature discrimination. A prototype-guided patch decoder progressively accumulates patch features by clustering similar patches into learnable prototype vectors, minimizing computational complexity and improving feature representation. A context-guided text decoder further improves text descriptions by using multi-granular image context, facilitating a more effective fusion of visual and textual modalities.

    The model proposed relies on CLIP as its underlying model and utilizes several additions to adapt it for pathology tasks. Whole-slide images are patchily segmented at the 5× and 10× magnification levels, while feature extraction uses a frozen ResNet-50 image encoder. A frozen large GPT-3.5 language model is also used to generate class-specific descriptive prompts for two scales with learnable vectors to facilitate effective feature representation. Progressive feature agglomeration is supported using a set of 16 learnable prototype vectors. The patch and prototype multi-granular features also help support the text embeddings, hence improved cross-modal alignment. Optimizing training makes use of the cross-entropy loss with equally weighted low- and high-scale similarity scores for robust classification support.

    This method demonstrates better performance on various subtyping datasets of cancer significantly outperforming current MIL-based and VLM-based methods in few-shot learning scenarios. The model records impressive gains in AUC, F1 score, and accuracy over three diverse datasets—TIHD-RCC, TCGA-RCC, and TCGA-Lung—demonstrating the model’s solidity in tests executed in both single-center and multi-center setups. In comparison to state-of-the-art approaches, significant gains in classification accuracy are observed with rises of 1.7% to 7.2% in AUC and 2.1% to 7.3% in F1 score. The employment of dual-scale text prompts with a prototype-guided patch decoder and context-guided text decoder aids the framework in its ability to learn effective discriminative morphological patterns despite the presence of few training instances. Moreover, excellent generalization abilities across multiple datasets suggest enhanced adaptability toward domain shift during cross-center testing. These observations demonstrate the merits of fusing vision-language models with pathology-specialized advances toward whole slide image classification. 

    Through the development of a new dual-scale vision-language learning framework, this research makes a substantial contribution to WSI classification with the utilization of large language models to prompt text and prototype-based feature aggregation. The method enhances few-shot generalization, decreases computational cost, and promotes interpretability, solving core pathology AI challenges. By building on the successful vision-language model transfer to digital pathology, this research is a valuable contribution to cancer diagnosis with AI, with the potential to generalize to other medical image tasks.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post ViLa-MIL: Enhancing Whole Slide Image Classification with Dual-Scale Vision-Language Multiple Instance Learning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMoonshot AI Research Introduce Mixture of Block Attention (MoBA): A New AI Approach that Applies the Principles of Mixture of Experts (MoE) to the Attention Mechanism
    Next Article Mistral AI Introduces Mistral Saba: A New Regional Language Model Designed to Excel in Arabic and South Indian-Origin Languages such as Tamil

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Elden Ring DLC release date: Shadow of the Erdtree launch time, preloads, file size, and more

    Development

    Microsoft shows Windows 11 upgrade pop-up on Windows 10, but it stops responding

    Development

    CVE-2025-3610 – Reales WP STPT Privilege Escalation and Account Takeover Vulnerability in WordPress

    Common Vulnerabilities and Exposures (CVEs)

    Android: testing location on device cloud

    Development

    Highlights

    Development

    Progressive Blur Effect For React

    April 24, 2024

    A React component to apply a drop-in progressive (gradient) backdrop blur to your React applications.…

    Your Android phone is getting a new security secret weapon – how it works

    April 15, 2025

    Optimizing Route Permissions with Laravel’s Enhanced Enum Support

    November 28, 2024

    I’m bringing this Bluetooth speaker to every cookout this summer – and you should too

    April 21, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.