iVGR: 강화 학습을 통한 MLLMs의 시각 기반 추론 내재화

초록

시각적 근거 연쇄 추론(Visually Grounded Chain-of-Thought, CoT)은 다중 모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)의 미세한 인식 능력을 향상시키기 위한 유망한 패러다임으로 부상했지만, 추론 단계에서의 효용성은 충분히 탐구되지 않았다. 본 연구에서는 추론 시 명시적 객체 경계 상자를 강제하는 시각적 근거 CoT가 명시적 시각적 근거 없이 추론하는 표준 텍스트 기반 CoT에 비해 성능을 저하시키는 경우가 많음을 실증적으로 발견했다. 우리는 시각적 위치 파악 능력이 텍스트 기반 CoT에 내재화될 수 있으며, 강제적인 명시적 근거가 답변 예측이라는 모델의 주요 목표에 불필요한 간섭을 초래한다고 가정한다. 이 문제를 해결하기 위해, 우리는 위치 파악 능력을 텍스트 추론 과정으로 전이하는 새로운 강화 학습 프레임워크인 iVGR(Internalizing Visually Grounded Reasoning)를 제안한다. 우리는 텍스트 흐름과 고품질의 시각적 근거 흐름을 제안된 일관성 보상을 통해 정렬하는 이중 흐름 학습 전략을 채택하여, 모델이 추론 중 명시적 근거 없이도 정확하게 위치를 파악할 수 있도록 한다. 광범위한 실험을 통해 우리의 방법이 미세한 벤치마크에서 기존 기준선을 크게 능가하면서도 도구 지원 추론 워크플로를 유연하게 지원할 수 있음을 입증한다.

English

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.