GFlow: 単眼動画からの4D世界復元

要旨

ビデオ入力から4Dシーンを再構築することは、重要でありながらも困難な課題です。従来の手法では、通常、マルチビュービデオ入力、既知のカメラパラメータ、または静的なシーンといった仮定に依存していますが、これらは一般的に実世界のシナリオでは存在しません。本論文では、これらの制約をすべて緩和し、AnyV4Dと呼ぶ非常に野心的でありながら実用的な課題に取り組みます。具体的には、単一のモノクロームビデオのみが利用可能であり、カメラパラメータも入力として与えられない状況を想定し、動的な4D世界とカメラポーズを同時に復元することを目指します。この目的のために、GFlowという新しいフレームワークを導入します。GFlowは、2Dの事前情報（深度とオプティカルフロー）のみを利用して、ビデオ（3D）を4Dの明示的な表現に変換し、空間と時間を通じてガウススプラッティングの流れを実現します。GFlowはまずシーンを静止部分と移動部分にクラスタリングし、その後、2Dの事前情報とシーンのクラスタリングに基づいてカメラポーズと3Dガウスポイントの動きを逐次最適化するプロセスを適用し、隣接するポイント間の忠実性とフレーム間の滑らかな動きを確保します。動的なシーンは常に新しいコンテンツを導入するため、新しい視覚コンテンツを統合するためのピクセル単位の密度化戦略も提案します。さらに、GFlowは単なる4D再構築の枠組みを超え、事前のトレーニングを必要とせずにフレーム間の任意のポイントを追跡し、教師なしで移動するオブジェクトをシーンからセグメント化することも可能にします。加えて、各フレームのカメラポーズをGFlowから導出することができ、カメラポーズを変更することでビデオシーンの新しい視点をレンダリングすることができます。明示的な表現を採用することで、シーンレベルまたはオブジェクトレベルの編集を容易に行うことができ、その汎用性と強力さを強調します。プロジェクトのウェブサイトはこちらです: https://littlepure2333.github.io/GFlow

English

Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power. Visit our project website at: https://littlepure2333.github.io/GFlow

GFlow: 単眼動画からの4D世界復元

GFlow: Recovering 4D World from Monocular Video

要旨

Support