GeoPQA：弥合多模态大语言模型在几何推理中的视觉感知鸿沟

摘要

近期强化学习（RL）的进展提升了大型语言模型（LLM）的推理能力，但对多模态大语言模型（MLLM）的影响有限。特别是在几何推理等视觉密集型任务中，MLLM经常产生幻觉，导致推理不准确。我们将此归因于MLLM中的感知瓶颈，它限制了推理训练的收益。为了量化这一问题，我们设计了一个几何感知问答基准（GeoPQA），针对基本几何概念和空间关系。在GeoPQA上的实验揭示了MLLM在视觉感知方面的显著不足，这制约了RL奖励信号的有效训练。为解决这一瓶颈，我们提出了一个两阶段RL训练框架，首先增强对几何结构的视觉感知，然后培养推理能力。应用于Qwen2.5-VL-3B-Instruct模型时，与直接推理训练方法相比，我们的两阶段训练使几何推理提升了9.7%，几何问题解决能力提高了9.1%。我们的方法还推广到其他视觉密集型领域，如图表理解，凸显了感知基础在MLLM有效推理中的重要性。

English

Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.

GeoPQA：弥合多模态大语言模型在几何推理中的视觉感知鸿沟

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

摘要

Support