细粒度偏好优化提升视觉语言模型的空间推理能力

摘要

当前的视觉-语言模型（VLMs）在细粒度空间推理方面存在困难，尤其是在需要多步逻辑和精确空间对齐的场景中。为此，我们提出了SpatialReasoner-R1，一种专为解决这些局限而设计的视觉-语言推理模型。为了构建高质量的空间推理监督信号，我们设计了一种多模型蒙特卡洛树搜索（M3CTS）方法，该方法能生成多样且逻辑一致的长链思维（LongCoT）推理轨迹。此外，我们提出了细粒度直接偏好优化（fDPO），通过引入分段特定的偏好粒度，结合空间奖励机制，对候选回答的视觉一致性、空间定位及逻辑连贯性进行评估，从而指导描述性接地与逻辑推理。实验结果显示，fDPO在空间质量任务上较标准DPO平均提升了4.1%，在空间数量任务上提升了9.0%。采用fDPO训练的SpatialReasoner-R1在SPATIALRGPT-Bench上创下了新的最高水平，平均准确率超越最强基线9.8%，同时在通用视觉-语言任务上保持了竞争力。

English

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.