Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions

Vision-language models have evolved significantly over the past few years, with two distinct generations emerging. The first generation, exemplified by CLIP and ALIGN, expanded on large-scale classification pretraining by utilizing web-scale data without requiring extensive human labeling. These models used caption embeddings obtained from language encoders to broaden the vocabulary for classification and retrieval tasks. The second generation, akin to T5 in language modeling, unified captioning and question-answering tasks through generative encoder-decoder modeling. Models like Flamingo, BLIP-2, and PaLI further scaled up these approaches. Recent developments have introduced an additional â€œinstruction tuningâ€ step to enhance user-friendliness. Alongside these advancements, systematic studies have aimed to identify the critical factors in vision-language models.Â

Building on this progress, DeepMind researchers present PaliGemma, an open vision-language model combining the strengths of the PaLI vision-language model series with the Gemma family of language models. This innovative approach builds upon the success of previous PaLI iterations, which demonstrated impressive scaling capabilities and performance improvements. PaliGemma integrates a 400M SigLIP vision model with a 2B Gemma language model, resulting in a sub-3B vision-language model that rivals the performance of much larger predecessors like PaLI-X, PaLM-E, and PaLI-3. The Gemma component, derived from the same technology powering the Gemini models, contributes its auto-regressive decoder-only architecture to enhance PaliGemmaâ€™s capabilitiesâ€”this fusion of advanced vision and language processing techniques positions PaliGemma as a significant advancement in multimodal AI.

PaliGemmaâ€™s architecture comprises three key components: a SigLIP ViTSo400m image encoder, a Gemma-2B v1.0 decoder-only language model, and a linear projection layer. The image encoder transforms input images into a sequence of tokens, while the language model processes text using its SentencePiece tokenizer. The linear projection layer aligns the dimensions of image and text tokens, allowing them to be concatenated. This simple yet effective design enables PaliGemma to handle various tasks, including image classification, captioning, and visual question-answering, through a flexible image+text in, text out API.

The modelâ€™s input sequence structure is carefully designed for optimal performance. Image tokens are placed at the beginning, followed by a BOS token, prefix tokens (task description), a SEP token, suffix tokens (prediction), an EOS token, and PAD tokens. This arrangement allows for full attention across the entire input, enabling image tokens to consider the task context when updating their representations. The suffix, which forms the output, is covered by an auto-regressive mask to maintain the generation processâ€™s integrity.

PaliGemmaâ€™s training process involves multiple stages to ensure comprehensive visual-language understanding. It begins with unimodal pretraining of individual components, followed by multimodal pretraining on a diverse mixture of tasks. Notably, the image encoder is not frozen during this stage, allowing for improved spatial and relational understanding. The training continues with a resolution increase stage, enhancing the modelâ€™s ability to handle high-resolution images and complex tasks. Finally, a transfer stage adapts the base model to specific tasks or use cases, demonstrating PaliGemmaâ€™s versatility and effectiveness across various applications.

The results demonstrate PaliGemmaâ€™s impressive performance across a wide range of visual-language tasks. The model excels in image captioning, achieving high scores on benchmarks like COCO-Captions and TextCaps. In visual question answering, PaliGemma shows strong performance on various datasets, including VQAv2, GQA, and ScienceQA. The model also performs well on more specialized tasks such as chart understanding (ChartQA) and OCR-related tasks (TextVQA, DocVQA). Notably, PaliGemma exhibits significant improvements when increasing image resolution from 224px to 448px and 896px, especially for tasks involving fine-grained details or text recognition. The modelâ€™s versatility is further demonstrated by its ability to handle video input tasks and image segmentation challenges.

Researchers also present the noteworthy findings from the PaliGemma research:

Simple square resizing (224Ã—224) performs as well as complex aspect-ratio preserving techniques for segmentation tasks.

Researchers introduced CountBenchQA, a new dataset addressing limitations in TallyQA for assessing VLMsâ€™ counting abilities.

Discrepancies were found in previously published WidgetCaps numbers, invalidating some comparisons.

Image annotations (e.g., red boxes) are as effective as text prompts for indicating widgets to be captioned.

RoPE interpolation for image tokens during resolution upscaling (Stage 2) showed no significant benefits.

PaliGemma demonstrates unexpected zero-shot generalization to 3D renders from Objaverse without specific training.

The model achieves state-of-the-art performance on MMVP, significantly outperforming larger models like GPT4-V and Gemini.

This research introduces PaliGemma, a robust, compact open-base VLM that excels in transfer learning across diverse tasks. This research demonstrates that smaller VLMs can achieve state-of-the-art performance on a wide spectrum of benchmarks, challenging the notion that larger models are always superior. By releasing the base model without instruction tuning, the researchers aim to provide a valuable foundation for further studies in instruction tuning and specific applications. This approach encourages a clearer distinction between base models and fine-tuned versions in VLM research, potentially opening new avenues for more efficient and versatile AI systems in the field of visual-language understanding.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 46k+ ML SubReddit

The post Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

Save $400 on the best Samsung TVs, laptops, tablets, and more when you sign up for Verizon 5G Home or Home Internet

NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

Big Changes at Meteor Software: Our Next Chapter

Apps in Generative AI – Transforming the Digital Experience

Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions

February 2025 Baseline monthly digest

Learn A1 Level Spanish

Windows 10 KB5049981 out with fixes (direct download .msu)

Top Books on Deep Learning and Neural Networks

Achieving Cybersecurity Goals Through GRC approach

Want to Try Tab Groups in Firefox? Hereâ€™s How

Salesforce AI Research Proposes DEI: AI Software Engineering Agents Org, Achieving a 34.3% Resolve Rate on SWE-Bench Lite, Crushing Closed-Source Systems

CVE-2025-0072 – Arm Ltd Valhall GPU Kernel Driver After Free Vulnerability

All the buzz of ChatGPTâ€™s Voice Mode from OpenAIâ€™s Spring Update is going down the drain

CVE-2025-4540 – MTSoftware C-Lodop Unquoted Search Path Vulnerability

Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions

Related Posts