透過寬基線匹配引發多模態大型語言模型中的複雜空間推理

摘要

寬基線匹配（WBM）需要整合幾何理解、視角變化、細粒度感知及遮擋推理，使其成為部署於物理環境中的多模態大型語言模型（MLLMs）在空間推理上極具挑戰性的測試平台。然而，目前的多模態大型語言模型在這些能力上缺乏系統性的評估與訓練框架。我們提出 ReasonMatch-Bench，這是一個依據視角位移與匹配粒度進行分層的基準測試，涵蓋室內、室外及以物體為中心的場景，結果顯示現有多模態大型語言模型在精細的寬基線對應任務上仍力有未逮：在一個困難的 90 樣本子集中，人類標註者達到 84.0 的 F1 分數，而現有最佳基準模型僅達 37.2。為彌補此差距，我們建立了一套可擴展的數據生成流程，能自動從大規模影片-3D 資料庫（包含 RGB-D 影片及運動恢復結構重建）中擷取寬基線視角對，從而提供多樣且可驗證的監督訊號。我們進一步提出動態對應強化學習（DCRL），該方法結合影像層級視角漸進與點層級對應課程，透過可驗證的獎勵機制改善 WBM 訓練，而無需明確的思維鏈監督。大量實驗顯示，DCRL 大幅提升了 ReasonMatch-Bench 的表現，並可遷移至相關的空間基準測試，同時在數個基準測試上維持一般的視覺理解能力，並有適度提升。

English

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.