細粒度偏好優化提升視覺語言模型的空間推理能力
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
June 26, 2025
作者: Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
cs.AI
摘要
当前的视觉-语言模型(VLMs)在细粒度空间推理方面存在困难,尤其是在需要多步逻辑和精确空间对齐的情况下。在本研究中,我们提出了SpatialReasoner-R1,一种专为解决这些局限而设计的视觉-语言推理模型。为了构建高质量的空间推理监督数据,我们设计了一种多模型蒙特卡洛树搜索(M3CTS)方法,该方法能生成多样且逻辑一致的长链思维(LongCoT)推理轨迹。此外,我们提出了细粒度直接偏好优化(fDPO),该方法引入了针对特定片段的偏好粒度,用于描述性基础和逻辑推理,并通过空间奖励机制指导,该机制基于视觉一致性、空间基础和逻辑连贯性来评估候选响应。实验结果表明,fDPO在空间质量任务上比标准DPO平均提升了4.1%,在空间数量任务上提升了9.0%。采用fDPO训练的SpatialReasoner-R1在SPATIALRGPT-Bench上创下了新的SoTA,平均准确率比最强基线高出9.8%,同时在通用视觉-语言任务上保持了竞争力。
English
Current Vision-Language Models (VLMs) struggle with fine-grained spatial
reasoning, particularly when multi-step logic and precise spatial alignment are
required. In this work, we introduce SpatialReasoner-R1, a vision-language
reasoning model designed to address these limitations. To construct
high-quality supervision for spatial reasoning, we design a Multi-Model Monte
Carlo Tree Search (M3CTS) method that generates diverse, logically consistent
Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose
fine-grained Direct Preference Optimization (fDPO), which introduces
segment-specific preference granularity for descriptive grounding and logical
reasoning, guided by a spatial reward mechanism that evaluates candidate
responses based on visual consistency, spatial grounding, and logical
coherence. Experimental results demonstrate that fDPO achieves an average
improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0%
gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a
new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in
average accuracy, while maintaining competitive performance on general
vision-language tasks.