GFlow：從單眼視頻中恢復4D世界

摘要

從視頻輸入中重建4D場景是一項至關重要但具有挑戰性的任務。傳統方法通常依賴多視角視頻輸入的假設、已知相機參數或靜態場景，而這些通常在野外環境中是缺失的。在本文中，我們放寬了所有這些限制，並處理了一個非常雄心勃勃但實際的任務，我們稱之為AnyV4D：我們僅假設有一個單眼視頻可用，沒有任何相機參數作為輸入，並且我們的目標是恢復動態4D世界以及相機姿勢。為此，我們引入了GFlow，一個新的框架，僅利用2D先驗（深度和光流）將視頻（3D）提升到4D明確表示，其中包括通過空間和時間的高斯擴散流。GFlow首先將場景分為靜止部分和移動部分，然後應用一個順序優化過程，基於2D先驗和場景分類來優化相機姿勢和3D高斯點的動態，確保鄰近點之間的保真度以及跨幀的平滑運動。由於動態場景總是引入新內容，我們還提出了一種新的面向像素的高斯點密集化策略，以整合新的視覺內容。此外，GFlow超越了僅僅4D重建的界限；它還實現了對任何點在幀之間的跟踪，無需事先訓練，並以非監督的方式從場景中分割移動物體。此外，每幀的相機姿勢可以從GFlow中推導出，從而實現通過改變相機姿勢對視頻場景進行新視圖渲染。通過採用明確表示，我們可以根據需要輕鬆進行場景級或對象級編輯，突出其多功能性和強大性。請訪問我們的項目網站：https://littlepure2333.github.io/GFlow

English

Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power. Visit our project website at: https://littlepure2333.github.io/GFlow

GFlow：從單眼視頻中恢復4D世界

GFlow: Recovering 4D World from Monocular Video

摘要

Support