ChatPaper.aiChatPaper

Vision-R1:通過視覺引導強化學習實現大型視覺語言模型的人類無監督對齊

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

March 23, 2025
作者: Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
cs.AI

摘要

大型视觉语言模型(LVLMs)通常遵循两阶段的训练范式——预训练与监督微调。近期,源自语言领域的偏好优化作为一种有效的后训练强化策略,已显现出提升LVLMs能力的潜力。然而,构建高质量的人工标注偏好数据以及开发能够模拟这些偏好的稳健奖励模型,既成本高昂又充满挑战。基于这一观察,我们提出了Vision-R1,一种新颖的视觉引导R1类强化学习算法,专为LVLMs设计,通过明确的视觉反馈来奖励模型。该算法仅利用精选的指令数据,无需专门的奖励模型和手工制作的偏好数据集。我们引入了一个准则驱动的奖励函数,进一步整合多维度反馈,以基于视觉任务逻辑全面评估模型完成情况。此外,我们提出了一种渐进式规则细化策略,在训练过程中动态调整奖励准则,从而实现模型的持续改进并缓解奖励欺骗问题。在分布内与分布外基准上的大量实验表明,使用Vision-R1对7B规模的LVLMs进行微调,能够实现一致性的性能提升,最高可达50%的改进,并超越了当前最先进的10倍规模模型。
English
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.

Summary

AI-Generated Summary

PDF192March 25, 2025