GeoPQA:彌合多模態大語言模型在幾何推理中的視覺感知差距
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning
September 22, 2025
作者: Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, Yu Rong
cs.AI
摘要
近期強化學習(RL)的進展提升了大型語言模型(LLMs)的推理能力,但對多模態大型語言模型(MLLMs)的影響有限。尤其在視覺密集任務如幾何推理中,MLLMs經常出現幻覺,導致推理不準確。我們將此歸因於MLLMs中的感知瓶頸,這限制了推理訓練的效益。為量化此問題,我們設計了一個幾何感知問答(GeoPQA)基準,針對基本幾何概念和空間關係進行測試。在GeoPQA上的實驗揭示了MLLMs在視覺感知方面的顯著不足,這限制了RL獎勵信號對有效訓練的作用。為解決此瓶頸,我們提出了一個兩階段RL訓練框架,首先增強對幾何結構的視覺感知,然後培養推理能力。應用於Qwen2.5-VL-3B-Instruct模型時,與直接推理訓練方法相比,我們的兩階段訓練使幾何推理提升了9.7%,幾何問題解決能力提升了9.1%。我們的方法也推廣到其他視覺密集領域,如圖表理解,突顯了感知基礎在有效MLLM推理中的重要性。
English
Recent advancements in reinforcement learning (RL) have enhanced the
reasoning abilities of large language models (LLMs), yet the impact on
multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like
geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate
reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps
the benefits of reasoning training. To quantify this, we design a
Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric
concepts and spatial relationships. Experiments on GeoPQA reveal significant
shortcomings of MLLMs in visual perception, which constrain RL reward signals
for effective training. To address this bottleneck, we propose a two-stage RL
training framework by first enhancing the visual perception of geometric
structures, then fostering reasoning capabilities. Applied to
Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by
9.7% and geometric problem solving by 9.1%, compared to the direct reasoning
training approach. Our method also generalizes to other vision-intensive
domains like figure understanding, highlighting the importance of perceptual
grounding in effective MLLM reasoning.