고해상도 시각적 추론을 위한 다중 턴 기반 접지 강화 학습

초록

최첨단 대규모 다중모달 모델(LMMs)은 고해상도 이미지를 처리할 때 어려움에 직면합니다. 이러한 입력은 방대한 시각적 토큰으로 변환되는데, 이 중 상당수는 다운스트림 작업과 무관합니다. 본 논문에서는 LMMs가 다중 턴 대화 프레임워크 내에서 모델이 예측한 위치 좌표를 기반으로 하위 이미지를 자동으로 잘라내어 핵심 시각적 영역에 반복적으로 집중할 수 있도록 하는 종단 간 강화 학습(RL) 프레임워크인 Multi-turn Grounding-based Policy Optimization(MGPO)을 제안합니다. 비용이 많이 드는 추가적인 위치 주석이 필요한 지도 미세 조정(SFT)과 비교하여, 우리의 접근 방식은 LMMs가 최종 답변의 정확성에서 도출된 이진 보상 함수만을 활용하여 RL 훈련 과정에서 강력한 위치 파악 능력을 발현할 수 있음을 강조합니다. 또한, LMMs가 롤아웃 과정에서 시각적 위치 파악을 자율적으로 트리거하는 데 어려움을 겪는 것을 관찰했습니다. 이러한 콜드 스타트 문제를 해결하기 위해, 우리는 다중 턴 대화 템플릿을 설계하고 정책 손실 계산을 여러 대화 라운드에서 생성된 모델 출력으로 제한함으로써 안정적인 최적화를 촉진합니다. 광범위한 실험 결과, 위치 주석 없이 표준 시각-질문-짧은 답변 데이터에 대해 훈련된 MGPO는 GRPO에 비해 더 강력한 위치 파악 능력을 효과적으로 이끌어내어, 인-분포 MME-Realworld에서 5.4%, 도전적인 아웃-오브-분포(OOD) V* Bench에서 5.2%의 개선을 보였습니다. 특히, 21K 샘플로 Qwen2.5-VL-7B에 대해 사후 훈련된 MGPO는 OOD V* Bench에서 OpenAI의 o1 및 GPT-4o 모델을 능가했습니다. 코드는 https://github.com/EvolvingLMMs-Lab/MGPO에서 확인할 수 있습니다.

English

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

고해상도 시각적 추론을 위한 다중 턴 기반 접지 강화 학습

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

초록

Support