Large Vision-Language Models (LVLMs) have made significant strides in recent years, yet several key limitations persist. One major challenge is aligning these models effectively with human expectations, particularly for tasks involving detailed and precise visual information. Traditionally, LVLMs undergo a two-stage training paradigm: pretraining followed by supervised fine-tuning. However, supervised fine-tuning alone cannot fully overcome limitations, such as the scarcity and high cost associated with generating large-scale, human-annotated preference datasets. Moreover, conventional reinforcement learning methods require expensive reward models that may not fully capture the nuanced and subjective nature of human feedback.
A team of researchers from China propose Vision-R1: a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. Vision-R1 leverages curated instruction data, thereby eliminating the dependency on specialized reward models and handcrafted preference datasets. Central to this method is a criterion-driven reward function, which provides comprehensive evaluations of model completions based on specific visual task criteria. Additionally, a progressive rule refinement strategy is employed, dynamically adjusting reward criteria throughout the training process. This approach ensures continuous performance improvement, effectively mitigating reward hacking issues and promoting more accurate object localization.
The Vision-R1 algorithm incorporates several critical technical innovations. First, the criterion-driven reward function includes dual format rewards, recall rewards, and precision rewards. Dual format rewards ensure outputs adhere strictly to template and content constraints, essential for reliable object detection tasks. The recall reward emphasizes the model’s capacity to identify all relevant instances, crucial for avoiding omissions in predictions. The precision reward encourages high-quality bounding box predictions by calculating the average Intersection over Union (IoU) of valid predictions. Furthermore, the progressive rule refinement strategy is inspired by curriculum learning principles, gradually increasing training difficulty through staged progression and differentiation policies, thereby fostering robust and generalized learning.
Experiments conducted using two state-of-the-art LVLMs, Griffon-G-7B and Qwen2.5-VL-7B, demonstrate the robust capabilities of Vision-R1. Results on in-domain datasets such as MSCOCO and ODINW-13 show significant performance enhancements. Specifically, Vision-R1 improves Griffon-G-7B’s mAP scores by 2.5% on average across diverse tasks. More impressively, Vision-R1 boosts Qwen2.5-VL-7B’s performance significantly, showing an 8.9% improvement in COCO object detection tasks and achieving superior scores compared to its larger, 72B counterpart. On challenging out-of-domain localization tasks, Vision-R1 consistently outperforms supervised fine-tuning (SFT), demonstrating its strong generalization capabilities and robustness in complex scenarios.
In conclusion, Vision-R1 introduces an innovative reinforcement learning approach tailored for LVLMs that effectively addresses existing alignment issues without requiring costly annotated datasets or complex reward modeling. Its criterion-driven reward structure and progressive rule refinement strategy not only enhance the accuracy and comprehensiveness of object localization tasks but also significantly improve generalization to unseen scenarios. The successful integration of Vision-R1 with contemporary LVLM architectures highlights its potential to serve as a foundational method, significantly advancing the state-of-the-art in vision-language understanding and practical deployment in real-world applications.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post Vision-R1: Redefining Reinforcement Learning for Large Vision-Language Models appeared first on MarkTechPost.
Source: Read MoreÂ