Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Researchers at Intel Labs Introduce LLaVA-Gemma: A Compact Vision-Language Model Leveraging the Gemma Large Language Model in Two Variants (Gemma-2B and Gemma-7B)

    Researchers at Intel Labs Introduce LLaVA-Gemma: A Compact Vision-Language Model Leveraging the Gemma Large Language Model in Two Variants (Gemma-2B and Gemma-7B)

    April 7, 2024

    Recent advancements in large language models (LLMs) and Multimodal Foundation Models (MMFMs) have spurred interest in large multimodal models (LMMs). Models like GPT-4, LLaVA, and their derivatives have shown remarkable performance in vision-language tasks such as Visual Question Answering and image captioning. However, their high computational demands have prompted exploration into smaller-scale LMMs.

    Researchers from Cognitive AI, Intel Labs, introduce LLaVA-Gemma, a suite of vision-language assistants trained from Gemma LLM variants, Gemma-2B and Gemma-7B and inspired by progress in small yet capable visual language models (VLMs) like LLaVA-Phi. LLaVA-Gemma allows researchers to investigate the trade-offs between computational efficiency and the richness of visual and linguistic understanding by possessing two variants with different parameter sizes. Also, the researchers examine how a massively increased token set affects multi-modal performance.

    LLaVA-Gemma follows the LLaVA framework with modifications, combining a pretrained vision encoder (like CLIP) and a pretrained language model (such as Gemma) via an MLP connector. It undergoes a two-stage training process: pretraining the MLP connector on a custom dataset, then jointly finetuning the language model and connector on multimodal instruction tuning examples. Deviations include using Gemma models for language backbone, employing the larger DINOv2 image encoder for vision, and exploring skipping the initial pretraining stage for improved performance. Both pretraining and finetuning stages are conducted with and without initial pretraining.

    For the 2B backbone, DinoV2 variants outperform CLIP variants on all benchmarks except POPE-F1 and MMVP. Comparing the training and eval speed for the two model sizes, The training time for the Gemma-2B model on 8 Intel Gaudi 2® AI accelerators was 4 hours, while the larger Gemma-7B model required 16 hours to train under the same conditions. This indicates that the Gemma-7B model, with its increased parameter count, takes approximately four times longer to train than the Gemma-2B model. The relative speed of the Gemma7B model is thus 0.25x compared to the Gemma-2B model. These results highlight the trade-off between model size and training efficiency, with larger models requiring significantly more computational resources and time.

    Contributions to this research are as follows:

    1. Researchers introduce LLaVA-Gemma, an MMFM leveraging compact, powerful Gemma language models for efficient multimodal interactions. 

    2. They extensively evaluate Gemma-2B and Gemma-7B model variants, providing valuable insights into the tradeoffs between computational efficiency and the richness of visual and linguistic understanding in LLMs.

    3. They present a deep exploration into alternate design choices and visualize attention with relevancy maps to enhance their understanding of the model’s performance and attention.

    In conclusion, The research introduces LLaVA-Gemma, a compact vision-language model utilizing Gemma LLM in two variants, Gemma-2B and Gemma-7B. This research provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. Evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models.

    Check out the Paper and HF Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 39k+ ML SubReddit

    The post Researchers at Intel Labs Introduce LLaVA-Gemma: A Compact Vision-Language Model Leveraging the Gemma Large Language Model in Two Variants (Gemma-2B and Gemma-7B) appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow to Use Google Colab: A Beginner’s Guide
    Next Article Top MLOps Books to Read in 2024

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Claude 3.5 Sonnet comes out on top in Galileo’s Hallucination Index

    Development

    ‘Inheritune’ by UT Austin Assists Efficient Language Model Training: Leveraging Inheritance and Reduced Data for Comparable Performance

    Development

    Apple to Pay Siri Users $20 Per Device in Settlement Over Accidental Siri Privacy Violations

    Development

    Understanding AI Agents: The Three Main Components – Conversation, Chain, and Agent

    Development

    Highlights

    Overture Maps Foundation global open map dataset is now generally available

    July 27, 2024

    The Overture Maps Foundation — a joint effort by AWS, Meta, Microsoft, and TomTom to…

    WordPress’ new AI website builder helps you quickly create your own site – and it’s free

    April 10, 2025

    Next-Generation Mobility Solutions with Agentic AI and MongoDB Atlas

    April 4, 2025

    How to take climate action with your code

    April 21, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.