GFlow：从单目视频中恢复4D世界

摘要

从视频输入中重建4D场景是一项至关重要但具有挑战性的任务。传统方法通常依赖于多视角视频输入的假设、已知摄像机参数或静态场景，而这些通常在野外场景下是缺失的。在本文中，我们放宽了所有这些约束，并解决了一个非常雄心勃勃但实际的任务，我们将其称为AnyV4D：我们假设只有一个单目视频可用作输入，没有任何摄像机参数，并且我们的目标是恢复动态的4D世界以及摄像机姿态。为此，我们引入了GFlow，这是一个新框架，仅利用2D先验（深度和光流）将视频（3D）提升到一个明确的4D表示，其中包括通过空间和时间的高斯飞溅流。 GFlow首先将场景分为静止部分和移动部分，然后应用一个顺序优化过程，基于2D先验和场景聚类来优化摄像机姿态和3D高斯点的动态，确保相邻点之间的保真度和跨帧的平滑移动。由于动态场景总是引入新内容，我们还提出了一种新的面向像素的高斯点稠密化策略，以整合新的视觉内容。此外，GFlow超越了单纯的4D重建的界限；它还能够跟踪任何点在帧之间的移动，无需事先训练，并以一种无监督的方式从场景中分割移动物体。此外，每帧的摄像机姿态可以从GFlow中推导出，从而可以通过改变摄像机姿态来渲染视频场景的新视图。通过采用明确的表示，我们可以根据需要轻松进行场景级或对象级的编辑，突显其多功能性和强大性。请访问我们的项目网站：https://littlepure2333.github.io/GFlow

English

Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power. Visit our project website at: https://littlepure2333.github.io/GFlow

GFlow：从单目视频中恢复4D世界

GFlow: Recovering 4D World from Monocular Video

摘要

Support