ChatPaper.aiChatPaper

UniVG-R1:基于强化学习的推理引导式通用视觉定位

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

May 20, 2025
作者: Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
cs.AI

摘要

传统的视觉定位方法主要针对单一图像场景下的简单文本参照。然而,将这些方法扩展到涉及隐含且复杂指令的现实场景,尤其是与多幅图像结合时,面临重大挑战,这主要归因于跨多样多模态情境下高级推理能力的不足。本研究致力于解决更为实用的通用定位任务,提出了UniVG-R1,一个推理引导的多模态大语言模型(MLLM),用于通用视觉定位,通过结合强化学习(RL)与冷启动数据来增强推理能力。具体而言,我们首先构建了一个高质量的思维链(CoT)定位数据集,标注了详细的推理链条,以通过监督微调引导模型走向正确的推理路径。随后,实施基于规则的强化学习,激励模型识别正确的推理链,从而提升其推理能力。此外,我们识别出随着RL训练推进,易样本占主导导致的难度偏差,并提出了一种难度感知的权重调整策略,以进一步增强性能。实验结果表明,UniVG-R1的有效性,在MIG-Bench上实现了9.1%的性能提升,超越了先前方法。此外,我们的模型展现出强大的泛化能力,在四个图像与视频推理定位基准测试中,零样本性能平均提升了23.4%。项目页面可通过https://amap-ml.github.io/UniVG-R1-page/访问。
English
Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.

Summary

AI-Generated Summary

PDF465May 22, 2025