ワイドベースラインマッチングによるMLLMにおける複雑な空間推論の誘発

要旨

広基線マッチング（WBM）は、幾何学的理解、視点変化、細粒度知覚、遮蔽推論の統合を必要とし、物理環境に展開されるマルチモーダル大規模言語モデル（MLLM）における空間推論の困難なテストベッドとなる。しかし、現在のMLLMはこれらの能力に対する体系的な評価と学習フレームワークを欠いている。本稿では、視点移動とマッチング粒度に基づいて層別化された、屋内、屋外、物体中心のシナリオにわたるベンチマークであるReasonMatch-Benchを導入し、現在のMLLMが依然として細粒度の広基線対応関係に苦戦していることを示す。困難な90サンプルのサブセットにおいて、人間のアノテータは84.0のF1値を達成する一方、最良の既存ベースラインは37.2に留まる。このギャップを埋めるために、大規模なビデオ-3Dコーパス（RGB-DビデオやSfM再構成を含む）から広基線ビューペアを自動的に抽出し、多様で検証可能な教師信号を生成するスケーラブルなデータ生成パイプラインを構築する。さらに、動的対応関係強化学習（DCRL）を提案する。これは、画像レベルの視点進行と点レベルの対応関係カリキュラムを組み合わせ、明示的なCoT教師信号なしに検証可能な報酬を通じてWBM学習を改善する。広範な実験により、DCRLがReasonMatch-Benchを大幅に改善し、関連する空間ベンチマークに転移するとともに、いくつかのベンチマークで緩やかな向上を示しながら、一般的な視覚理解性能を維持することを示す。

English

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.