Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Google AI Introduces PaliGemma: A New Family of Vision Language Models 

    Google AI Introduces PaliGemma: A New Family of Vision Language Models 

    May 16, 2024

    Google has released a new family of vision language models called PaliGemma. PaliGemma can produce text by receiving an image and a text input. The architecture of the PaliGemma (Github) family of vision-language models consists of the image encoder SigLIP-So400m and the text decoder Gemma-2B. A cutting-edge model that can comprehend both text and visuals is called SigLIP. It comprises a joint-trained image and text encoder, similar to CLIP. Like PaLI-3, the combined PaliGemma model can be easily refined on downstream tasks like captioning or referencing segmentation after it has been pre-trained on image-text data. Gemma is a text-generating model that requires a decoder. By utilizing a linear adapter to integrate Gemma with SigLIP’s image encoder, PaliGemma becomes a potent vision language model.

    Big_vision was used as the training codebase for PaliGemma. Using the same codebase, numerous other models, including CapPa, SigLIP, LiT, BiT, and the original ViT, have already been developed. 

    The PaliGemma release includes three distinct model types, each offering a unique set of capabilities:

    PT checkpoints: These pretrained models are highly adaptable and designed to excel in a variety of tasks. Blend checkpoints: PT models adjusted for a variety of tasks. They can only be used for research purposes and are appropriate for general-purpose inference with free-text prompts.

    FT checkpoints: A collection of refined models focused on a distinct academic standard. They are only meant for research and come in various resolutions.

    The models are available in three distinct precision levels (bfloat16, float16, and float32) and three different resolution levels (224×224, 448×448, and 896×896). Each repository holds the checkpoints for a certain job and resolution, with three revisions for every precision possible. The main branch of each repository has float32 checkpoints, while the bfloat16 and float16 revisions have matching precisions. It’s important to note that models compatible with the original JAX implementation and hugging face transformers have different repositories.

    The high-resolution models, while offering superior quality, require significantly more memory due to their longer input sequences. This could be a consideration for users with limited resources. However, the quality gain is negligible for most tasks, making the 224 versions a suitable choice for the majority of uses.

    PaliGemma is a single-turn visual language model that performs best when tuned to a particular use case. It is not intended for conversational use. This means that while it excels in specific tasks, it may not be the best choice for all applications.

    Users can specify the task the model will perform by qualifying it with task prefixes like ‘detect’ or ‘segment ‘. This is because the pretrained models were trained in a way to give them a wide range of skills, such as question-answering, captioning, and segmentation. However, instead of being used immediately, they are designed to be fine-tuned to specific tasks using a comparable prompt structure. The ‘mix’ family of models, refined on various tasks, can be used for interactive testing.

    Here are some examples of what PaliGemma can do: it can add captions to pictures, respond to questions about images, detect entities in pictures, segment entities within images, and reason and understand documents. These are just a few of its many capabilities.

    When asked, PaliGemma can add captions to pictures. With the mix checkpoints, users can experiment with different captioning prompts to observe how they react.

    PaliGemma can respond to a question about an image passed on with it. 

    PaliGemma may use the detect [entity] prompt to find entities in a picture. The bounding box coordinate location will be printed as unique tokens, where the value is an integer that denotes a normalized coordinate. 

    When prompted with the segment [entity] prompt, PaliGemma mix checkpoints can also segment entities within an image. Because the team utilizes natural language descriptions to refer to the things of interest, this technique is known as referring expression segmentation. The output is a series of segmentation and location tokens. As previously mentioned, a bounding box is represented by the location tokens. Segmentation masks can be created by processing the segmentation tokens one more time.

    PaliGemma mix checkpoints are very good at reasoning and understanding documents.

    he field.

    Check out the Blog, Model, and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post Google AI Introduces PaliGemma: A New Family of Vision Language Models  appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow LotteON built a personalized recommendation system using Amazon SageMaker and MLOps
    Next Article Harmonics of Learning: A Mathematical Theory for the Rise of Fourier Features in Learning Systems Like Neural Networks

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    I rescued my dying 2017 MacBook Pro with Ubuntu and it works like a charm (mostly)

    News & Updates

    New game blending Valheim and Mount & Blade comes to Xbox later this year

    News & Updates

    CVE-2025-3823 – SourceCodester Web-based Pharmacy Product Management System Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Internet Explorer exploit could let phishers steal logins

    Development

    Highlights

    File Lock PEA – filesystem-level encryption

    February 7, 2025

    File Lock PEA (PEA = Password Encrypting Archive) encrypts data at the filesystem level The…

    IBM’s next generation Granite models are now available

    February 26, 2025

    SpotBugs Access Token Theft Identified as Root Cause of GitHub Supply Chain Attack

    April 4, 2025

    Advancing Parallel Programming with HPC-INSTRUCT: Optimizing Code LLMs for High-Performance Computing

    December 29, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.