UniVG-R1:基於強化學習的推理引導通用視覺定位
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
May 20, 2025
作者: Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
cs.AI
摘要
傳統的視覺定位方法主要集中於單一圖像場景下的簡單文本參照。然而,將這些方法擴展至涉及隱含且複雜指令的現實世界場景,尤其是與多幅圖像結合時,面臨著重大挑戰,這主要是由於在多模態情境下缺乏高級的推理能力。在本研究中,我們致力於解決更為實用的通用定位任務,並提出了UniVG-R1,這是一個基於推理指導的多模態大語言模型(MLLM),用於通用視覺定位,它通過強化學習(RL)結合冷啟動數據來增強推理能力。具體而言,我們首先構建了一個高質量的思維鏈(CoT)定位數據集,該數據集附有詳細的推理鏈註釋,以通過監督微調引導模型走向正確的推理路徑。隨後,我們實施基於規則的強化學習,以鼓勵模型識別正確的推理鏈,從而激勵其推理能力。此外,我們發現隨著RL訓練的進行,易於樣本的普遍存在導致了難度偏差,因此我們提出了一種難度感知的權重調整策略,以進一步提升性能。實驗結果證明了UniVG-R1的有效性,其在MIG-Bench上實現了9.1%的性能提升,超越了先前的方法。此外,我們的模型展現出強大的泛化能力,在四個圖像和視頻推理定位基準測試中,零樣本性能平均提升了23.4%。項目頁面可訪問於https://amap-ml.github.io/UniVG-R1-page/。
English
Traditional visual grounding methods primarily focus on single-image
scenarios with simple textual references. However, extending these methods to
real-world scenarios that involve implicit and complex instructions,
particularly in conjunction with multiple images, poses significant challenges,
which is mainly due to the lack of advanced reasoning ability across diverse
multi-modal contexts. In this work, we aim to address the more practical
universal grounding task, and propose UniVG-R1, a reasoning guided multimodal
large language model (MLLM) for universal visual grounding, which enhances
reasoning capabilities through reinforcement learning (RL) combined with
cold-start data. Specifically, we first construct a high-quality
Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning
chains, to guide the model towards correct reasoning paths via supervised
fine-tuning. Subsequently, we perform rule-based reinforcement learning to
encourage the model to identify correct reasoning chains, thereby incentivizing
its reasoning capabilities. In addition, we identify a difficulty bias arising
from the prevalence of easy samples as RL training progresses, and we propose a
difficulty-aware weight adjustment strategy to further strengthen the
performance. Experimental results demonstrate the effectiveness of UniVG-R1,
which achieves state-of-the-art performance on MIG-Bench with a 9.1%
improvement over the previous method. Furthermore, our model exhibits strong
generalizability, achieving an average improvement of 23.4% in zero-shot
performance across four image and video reasoning grounding benchmarks. The
project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.Summary
AI-Generated Summary