Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model

Llama 3 has significantly outperformed GPT-3.5 and even surpassed GPT-4 in several benchmarks, showcasing its strength in efficiency and task-specific performance despite having fewer parameters. However, GPT-4o emerged with advanced multimodal capabilities, reclaiming the top position. Llama 3, utilizing innovations like Grouped-Query Attention, excels in translation and dialogue generation, while GPT-4 demonstrates superior reasoning and problem-solving skills. GPT-4o further enhances these abilities, solidifying its dominance with improved neural architecture and multimodal proficiency.

This study presents Llama3-V, a multimodal model based on Llama3, trained for under $500. It integrates visual information by embedding input images into patch embeddings using the SigLIP model. These embeddings align with textual tokens via a projection block using self-attention blocks, placing visual and textual embeddings on the same plane. The visual tokens are then prepended to the textual tokens, and the joint representation is processed through Llama3, enhancing its ability to understand and integrate visual data.

SigLIP, an image embedding model, uses a pairwise sigmoid loss for processing each image-text pair independently, unlike CLIPâ€™s contrastive loss with softmax normalization. SigLIPâ€™s vision encoder divides images into non-overlapping patches, projecting them into a lower-dimensional embedding space and applying self-attention for higher-level feature extraction. To align SigLIPâ€™s image embeddings with Llama3â€™s textual embeddings, a projection module with two self-attention blocks is used. Visual tokens from these embeddings are prepended to textual tokens, creating a joint input for Llama3.

To optimize computational resources, two major strategies were employed. First, a caching mechanism precomputes SigLIP image embeddings, increasing GPU utilization and batch size without causing out-of-memory errors. This separation of SigLIP and Llama3 processing stages enhances efficiency. Second, utilization of MPS/MLX optimizations, SigLIP, due to its smaller size, runs inference on Macbooks and achieves a throughput of 32 images/second. These optimizations save training and inference time by efficiently managing resources and maximizing GPU usage.

Precomputing image embeddings via SigLIP involves loading the SigLIP model, preprocessing images, and obtaining vector representations. High-resolution images are split into patches for efficient encoding. Sigmoid activation is applied to logits to extract embeddings, which are then projected into a joint multimodal space using a learned weight matrix. These projected embeddings, or â€œlatents,â€ are prepended to text tokens for pretraining Llama3. Pretraining uses 600,000 image-text pairs, updating only the projection matrix. Supervised finetuning enhances performance using 1M examples, focusing on the vision and projection matrices.

Llama3-V achieves a 10â€“20% performance boost over Llava, the leading model for multimodal understanding. It also performs comparably to much larger closed-source models across most metrics, except for MMMU, demonstrating its efficiency and competitiveness despite a smaller size.

To recapitulate, Llama3-V demonstrates significant advancements in multimodal AI, outperforming Llava and rivaling larger closed-source models in most metrics. By integrating SigLIP for efficient image embedding and employing strategic computational optimizations, Llama3-V maximizes GPU utilization and reduces training costs. Pretraining and supervised finetuning enhance its multimodal capabilities, leading to a significant 10â€“20% performance boost over Llava. Llama3-Vâ€™s innovative approach and cost-effective training establish it as a competitive and efficient state-of-the-art model for multimodal understanding.

Check out theÂ Github, Model, and Blog. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 43k+ ML SubReddit | Also, check out our AI Events Platform

The post Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Omnichannel Alphabet Soup: The ABCâ€™s of Omni OMS

Linux Foundation & Google Form New Group to Manage Chromium

My Journey to Stunning Videos: Unleashing Creativity with FlexClip Video Editing Software

HackGATE: Setting New Standards for Visibility and Control in Penetration Testing Projects

How to Fix the MSI Dragon Center BIOS Update Black Screen

Build a Timer with Pure CSS

xECGArch: A Multi-Scale Convolutional Neural Network CNN for Accurate and Interpretable Atrial Fibrillation Detection in ECG Analysis

5 ways AMD can bungle its RDNA 4 launch — Will NVIDIA GPUs get the competition they need?

Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model

Related Posts