G-CUT3R: カメラと深度事前情報を統合したガイド付き3D再構成

要旨

本論文では、事前情報を統合することでCUT3Rモデルを強化する、ガイド付き3Dシーン再構成のための新規フィードフォワード手法G-CUT3Rを紹介する。既存のフィードフォワード手法が入力画像のみに依存するのに対し、我々の手法は、現実世界のシナリオで一般的に利用可能な深度、カメラキャリブレーション、カメラ位置などの補助データを活用する。CUT3Rに軽量な修正を加え、各モダリティ専用のエンコーダを導入して特徴を抽出し、ゼロ畳み込みを介してRGB画像トークンと融合する。この柔軟な設計により、推論時に任意の組み合わせの事前情報をシームレスに統合できる。3D再構成やその他のマルチビュータスクを含む複数のベンチマークで評価を行った結果、本手法は利用可能な事前情報を効果的に活用しつつ、様々な入力モダリティとの互換性を維持する能力を示し、大幅な性能向上を実証した。

English

We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.