基於多輪接地的高分辨率視覺推理強化學習
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
July 8, 2025
作者: Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu
cs.AI
摘要
現今的大型多模態模型(LMMs)在處理高解析度圖像時面臨挑戰,因為這些輸入會被轉換為大量的視覺標記,其中許多與下游任務無關。本文提出了一種基於多輪對話框架的多輪定位策略優化(MGPO),這是一個端到端的強化學習(RL)框架,使LMMs能夠根據模型預測的定位座標,在多輪對話中自動裁剪子圖像,從而迭代地聚焦於關鍵視覺區域。與需要昂貴額外定位註釋的監督微調(SFT)相比,我們的方法強調LMMs可以在RL訓練過程中僅基於最終答案正確性的二元獎勵函數,展現出強大的定位能力。此外,我們觀察到LMMs在rollout過程中難以自主觸發視覺定位。為解決這一冷啟動問題,我們設計了一個多輪對話模板,並將策略損失計算限制在多輪對話生成的模型輸出上,從而促進穩定的優化。大量實驗表明,在沒有定位註釋的標準視覺問答短答案數據上訓練時,MGPO相比GRPO能有效激發更強的定位能力,在分佈內MME-Realworld上提升了5.4%,在具有挑戰性的分佈外(OOD)V* Bench上提升了5.2%。值得注意的是,MGPO在Qwen2.5-VL-7B上使用21K樣本進行後訓練後,在OOD V* Bench上超越了OpenAI的o1和GPT-4o模型。代碼可在https://github.com/EvolvingLMMs-Lab/MGPO 獲取。
English
State-of-the-art large multi-modal models (LMMs) face challenges when
processing high-resolution images, as these inputs are converted into enormous
visual tokens, many of which are irrelevant to the downstream task. In this
paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an
end-to-end reinforcement learning (RL) framework that enables LMMs to
iteratively focus on key visual regions by automatically cropping sub-images,
based on model-predicted grounding coordinates within a multi-turn conversation
framework. Compared to supervised fine-tuning (SFT), which requires costly
additional grounding annotations, our approach highlights that LMMs can emerge
robust grounding abilities during the RL training process, leveraging only a
binary reward function derived from the correctness of the final answer.
Additionally, we observe that LMMs struggle to autonomously trigger visual
grounding during the rollout process. To address this cold start problem, we
design a multi-turn conversational template and restrict policy loss
computation to model outputs generated across multiple dialogue rounds, thereby
promoting stable optimization. Extensive experiments demonstrate that, when
trained on standard visual-question-short answering data without grounding
annotations, MGPO effectively elicits stronger grounding capabilities compared
to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and
5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench.
Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses
OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at
https://github.com/EvolvingLMMs-Lab/MGPO.