MLLM에서 광역 베이스라인 매칭을 통한 복잡한 공간 추론 유도

초록

광시야각 정합(WBM)은 기하학적 이해, 시점 변화, 세부 인식 및 가림 추론을 통합해야 하므로, 물리적 환경에 배치된 다중모드 대규모 언어 모델(MLLM)의 공간 추론 능력을 평가하기 위한 까다로운 테스트베드이다. 그러나 현재의 MLLM은 이러한 능력에 대한 체계적인 평가 및 훈련 프레임워크가 부족하다. 본 논문에서는 실내, 실외 및 객체 중심 시나리오에서 시점 변위와 정합 세분성에 따라 계층화된 벤치마크인 ReasonMatch-Bench를 소개하며, 현재의 MLLM이 여전히 세부적인 광시야각 대응에 어려움을 겪고 있음을 보여준다. 어려운 90개 샘플 하위 집합에서 인간 주석자는 84.0 F1을 달성한 반면, 최고의 기존 기준선은 37.2에 그친다. 이러한 격차를 해소하기 위해, RGB-D 비디오 및 SfM 재구성을 포함한 대규모 비디오-3D 코퍼스에서 광시야각 뷰 쌍을 자동으로 추출하여 다양하고 검증 가능한 감독을 생성하는 확장 가능한 데이터 생성 파이프라인을 구축한다. 또한 명시적인 CoT 감독 없이 검증 가능한 보상을 통해 WBM 훈련을 개선하기 위해 이미지 수준 시점 진행(Image-Level Viewpoint Progression)과 점 수준 대응 커리큘럼(Point-Level Correspondence Curriculum)을 결합한 동적 대응 강화 학습(DCRL)을 제안한다. 광범위한 실험을 통해 DCRL이 ReasonMatch-Bench를 크게 개선하고 관련 공간 벤치마크로 전이되며, 여러 벤치마크에서 약간의 성능 향상과 함께 일반적인 시각 이해 성능을 유지함을 보여준다.

English

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.