ChatPaper.aiChatPaper

通过宽基线匹配激发多模态大语言模型中的复杂空间推理

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

June 2, 2026
作者: Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen
cs.AI

摘要

宽基线匹配(WBM)要求融合几何理解、视角变化、细粒度感知与遮挡推理能力,这使得它成为评估部署于物理环境中的多模态大语言模型(MLLMs)空间推理能力的严苛测试平台。然而,现有MLLMs缺乏针对这些能力的系统性评估与训练框架。我们提出ReasonMatch-Bench,该基准根据视角位移幅度与匹配粒度分层,涵盖室内、室外及物体中心场景,实验表明当前MLLMs在细粒度宽基线对应任务中仍存在显著不足:在包含90个样本的困难子集上,人类标注者达到84.0的F1分数,而现有最优基线仅达37.2。为弥合这一差距,我们构建了可扩展的数据生成流程,能够从大规模视频-3D语料库(包括RGB-D视频和运动恢复结构重建数据)中自动提取宽基线视角对,从而产生多样化且可验证的监督信号。进一步地,我们提出动态对应强化学习(DCRL)方法,该方法结合图像级视角递进与点级对应课程,通过可验证奖励提升WBM训练效果,而无需显式的思维链(CoT)监督。大量实验表明,DCRL显著提升了ReasonMatch-Bench的性能,并且能够迁移至相关空间基准任务,同时在多个基准测试中保持甚至适度提升了通用视觉理解能力。
English
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.