M^3: 단안 가우시안 스플래팅 SLAM을 위한 조밀 매칭과 다중 뷰 파운데이션 모델의 융합

초록

보정되지 않은 단안 비디오로부터의 실시간 재구성은 동적 환경에서 높은 정밀도의 포즈 추정과 계산적으로 효율적인 온라인 정교화를 모두 요구하기 때문에 여전히 어려운 과제로 남아 있습니다. 3D 파운데이션 모델을 SLAM 프레임워크와 결합하는 것은 유망한 패러다임이지만, 중요한 병목 현상이 지속됩니다: 대부분의 다중 뷰 파운데이션 모델은 피드포워드 방식으로 포즈를 추정하여 엄격한 기하학적 최적화에 필요한 정밀도를 갖추지 못한 픽셀 수준의 대응점을 생성합니다. 이를 해결하기 위해, 우리는 다중 뷰 파운데이션 모델에 정교한 조밀한 대응점 생성을 위한 전용 매칭 헤드를 추가하고 이를 강력한 단안 가우시안 스플래팅 SLAM에 통합한 M^3을 제안합니다. M^3은 동적 영역 억제 및 교차 추론 내부 파라미터 정렬을 통합하여 추적 안정성을 더욱 향상시킵니다. 다양한 실내 및 실외 벤치마크에서 진행한 광범위한 실험을 통해 포즈 추정과 장면 재구성 모두에서 최첨단 정확도를 입증했습니다. 특히 M^3은 ScanNet++ 데이터셋에서 VGGT-SLAM 2.0 대비 ATE RMSE를 64.3% 감소시키고, ARTDECO보다 PSNR에서 2.11 dB 더 우수한 성능을 보였습니다.

English

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

M^3: 단안 가우시안 스플래팅 SLAM을 위한 조밀 매칭과 다중 뷰 파운데이션 모델의 융합

M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

초록

Support