M^3: 単眼ガウススプラッティングSLAMにおける高密度マッチングとマルチビューファウンデーションモデルの融合

要旨

未較正の単眼ビデオからのストリーミング再構成は、高精度なポーズ推定と動的環境下での計算効率的なオンラインリファインメントの両方が要求されるため、依然として課題が多い。3D基盤モデルとSLAMフレームワークの連携は有望なパラダイムであるが、決定的なボトルネックが存在する：ほとんどの多視点基盤モデルはフィードフォワード方式でポーズを推定するため、厳密な幾何最適化に必要な精度を欠くピクセルレベルの対応関係しか得られない。この問題に対処するため、我々はM^3を提案する。M^3は、多視点基盤モデルに専用のマッチングヘッドを追加して微細な密な対応関係を実現し、それをロバストな単眼ガウススプラッティングSLAMに統合する。さらにM^3は、動的領域抑制とクロス推論に基づく内部パラメータアライメントを組み込むことで、トラッキングの安定性を向上させる。多様な屋内および屋外ベンチマークでの大規模な実験により、ポーズ推定とシーン再構成の両方で最先端の精度を実証した。特に、ScanNet++データセットにおいて、ATE RMSEをVGGT-SLAM 2.0と比較して64.3%低減し、PSNRではARTDECOを2.11 dB上回った。

English

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

M^3: 単眼ガウススプラッティングSLAMにおける高密度マッチングとマルチビューファウンデーションモデルの融合

M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

要旨

Support