持続的状態を持つ連続3D知覚モデル

要旨

幅広い3Dタスクを解決できる統合フレームワークを提案します。当該手法は、状態を持つ再帰モデルを特徴とし、各新しい観測ごとにその状態表現を連続的に更新します。画像のストリームが与えられると、この進化する状態は、オンラインで新しい入力ごとにメトリックスケールのポイントマップ（ピクセルごとの3Dポイント）を生成するために使用できます。これらのポイントマップは共通の座標系内に存在し、新しい画像が到着するたびに更新される一貫した密なシーン再構築に蓄積できます。CUT3R（3D再構築用の連続更新トランスフォーマー）と呼ばれる当該モデルは、現実世界のシーンの豊富な先行事項を捉えます。画像の観測から正確なポイントマップを予測するだけでなく、未見のシーン領域を探査することで、仮想的な未観測ビューで推論することもできます。当該手法はシンプルでありながら非常に柔軟であり、ビデオストリームまたは順不同の写真コレクションである可能性があり、静的および動的なコンテンツの両方を含む画像の長さを自然に受け入れます。我々は、さまざまな3D/4Dタスクで当該手法を評価し、各タスクで競争力のあるまたは最先端のパフォーマンスを示します。プロジェクトページ：https://cut3r.github.io/

English

We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: https://cut3r.github.io/

持続的状態を持つ連続3D知覚モデル

Continuous 3D Perception Model with Persistent State

要旨

Support