通过宽基线匹配激发多模态大语言模型中的复杂空间推理

摘要

宽基线匹配（WBM）要求融合几何理解、视角变化、细粒度感知与遮挡推理能力，这使得它成为评估部署于物理环境中的多模态大语言模型（MLLMs）空间推理能力的严苛测试平台。然而，现有MLLMs缺乏针对这些能力的系统性评估与训练框架。我们提出ReasonMatch-Bench，该基准根据视角位移幅度与匹配粒度分层，涵盖室内、室外及物体中心场景，实验表明当前MLLMs在细粒度宽基线对应任务中仍存在显著不足：在包含90个样本的困难子集上，人类标注者达到84.0的F1分数，而现有最优基线仅达37.2。为弥合这一差距，我们构建了可扩展的数据生成流程，能够从大规模视频-3D语料库（包括RGB-D视频和运动恢复结构重建数据）中自动提取宽基线视角对，从而产生多样化且可验证的监督信号。进一步地，我们提出动态对应强化学习（DCRL）方法，该方法结合图像级视角递进与点级对应课程，通过可验证奖励提升WBM训练效果，而无需显式的思维链（CoT）监督。大量实验表明，DCRL显著提升了ReasonMatch-Bench的性能，并且能够迁移至相关空间基准任务，同时在多个基准测试中保持甚至适度提升了通用视觉理解能力。

English

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.