GeoPQA: 기하학적 추론을 위한 MLLM의 시각적 인식 격차 해소

초록

최근 강화학습(RL)의 발전으로 대형 언어 모델(LLM)의 추론 능력이 향상되었지만, 다중모달 LLM(MLLM)에 미치는 영향은 제한적입니다. 특히 기하학적 추론과 같은 시각 중심 작업에서 MLLM은 빈번한 환각 현상을 보이며, 이는 부정확한 추론으로 이어집니다. 우리는 이를 MLLM의 지각적 병목 현상으로 보고, 이로 인해 추론 훈련의 효과가 제한된다고 분석합니다. 이를 정량화하기 위해 기본적인 기하학적 개념과 공간 관계를 대상으로 한 Geo-Perception Question-Answering(GeoPQA) 벤치마크를 설계했습니다. GeoPQA 실험 결과, MLLM의 시각적 지각 능력이 심각하게 부족하며, 이는 효과적인 훈련을 위한 RL 보상 신호를 제한하는 것으로 나타났습니다. 이러한 병목 현상을 해결하기 위해, 우리는 두 단계의 RL 훈련 프레임워크를 제안합니다. 첫 번째 단계에서는 기하학적 구조에 대한 시각적 지각 능력을 강화하고, 두 번째 단계에서는 추론 능력을 키우는 방식입니다. Qwen2.5-VL-3B-Instruct에 적용한 결과, 직접적인 추론 훈련 방식에 비해 기하학적 추론 능력이 9.7%, 기하학적 문제 해결 능력이 9.1% 향상되었습니다. 또한, 이 방법은 도형 이해와 같은 다른 시각 중심 영역에서도 일반화 가능성을 보여주며, 효과적인 MLLM 추론을 위한 지각적 기반의 중요성을 강조합니다.

English

Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.