基于多轮接地的强化学习实现高分辨率视觉推理

摘要

当前领先的大型多模态模型（LMMs）在处理高分辨率图像时面临挑战，因为这些输入被转化为大量视觉标记，其中许多与下游任务无关。本文提出了一种基于多轮对话框架的多轮定位策略优化（MGPO）方法，这是一个端到端的强化学习（RL）框架，使LMMs能够通过自动裁剪子图像，基于模型预测的定位坐标，在多轮对话中迭代聚焦于关键视觉区域。与需要昂贵额外定位标注的监督微调（SFT）相比，我们的方法表明，LMMs在RL训练过程中可以涌现出强大的定位能力，仅利用从最终答案正确性得出的二元奖励函数。此外，我们观察到LMMs在展开过程中难以自主触发视觉定位。为解决这一冷启动问题，我们设计了一个多轮对话模板，并将策略损失计算限制在跨多轮对话生成的模型输出上，从而促进稳定优化。大量实验证明，当在无定位标注的标准视觉问答短答数据上训练时，MGPO相比GRPO能有效激发更强的定位能力，在分布内MME-Realworld上提升5.4%，在具有挑战性的分布外（OOD）V* Bench上提升5.2%。值得注意的是，MGPO在Qwen2.5-VL-7B上使用21K样本进行后训练后，在OOD V* Bench上超越了OpenAI的o1和GPT-4o模型。代码可在https://github.com/EvolvingLMMs-Lab/MGPO获取。

English

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

基于多轮接地的强化学习实现高分辨率视觉推理

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

摘要

Support