GFlow:从单目视频中恢复4D世界
GFlow: Recovering 4D World from Monocular Video
May 28, 2024
作者: Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, Xinchao Wang
cs.AI
摘要
从视频输入中重建4D场景是一项至关重要但具有挑战性的任务。
传统方法通常依赖于多视角视频输入的假设、已知摄像机参数或静态场景,而这些通常在野外场景下是缺失的。
在本文中,我们放宽了所有这些约束,并解决了一个非常雄心勃勃但实际的任务,我们将其称为AnyV4D:我们假设只有一个单目视频可用作输入,没有任何摄像机参数,并且我们的目标是恢复动态的4D世界以及摄像机姿态。
为此,我们引入了GFlow,这是一个新框架,仅利用2D先验(深度和光流)将视频(3D)提升到一个明确的4D表示,其中包括通过空间和时间的高斯飞溅流。
GFlow首先将场景分为静止部分和移动部分,然后应用一个顺序优化过程,基于2D先验和场景聚类来优化摄像机姿态和3D高斯点的动态,确保相邻点之间的保真度和跨帧的平滑移动。
由于动态场景总是引入新内容,我们还提出了一种新的面向像素的高斯点稠密化策略,以整合新的视觉内容。
此外,GFlow超越了单纯的4D重建的界限;它还能够跟踪任何点在帧之间的移动,无需事先训练,并以一种无监督的方式从场景中分割移动物体。
此外,每帧的摄像机姿态可以从GFlow中推导出,从而可以通过改变摄像机姿态来渲染视频场景的新视图。
通过采用明确的表示,我们可以根据需要轻松进行场景级或对象级的编辑,突显其多功能性和强大性。请访问我们的项目网站:https://littlepure2333.github.io/GFlow
English
Reconstructing 4D scenes from video inputs is a crucial yet challenging task.
Conventional methods usually rely on the assumptions of multi-view video
inputs, known camera parameters, or static scenes, all of which are typically
absent under in-the-wild scenarios. In this paper, we relax all these
constraints and tackle a highly ambitious but practical task, which we termed
as AnyV4D: we assume only one monocular video is available without any camera
parameters as input, and we aim to recover the dynamic 4D world alongside the
camera poses. To this end, we introduce GFlow, a new framework that utilizes
only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit
representation, entailing a flow of Gaussian splatting through space and time.
GFlow first clusters the scene into still and moving parts, then applies a
sequential optimization process that optimizes camera poses and the dynamics of
3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity
among neighboring points and smooth movement across frames. Since dynamic
scenes always introduce new content, we also propose a new pixel-wise
densification strategy for Gaussian points to integrate new visual content.
Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also
enables tracking of any points across frames without the need for prior
training and segments moving objects from the scene in an unsupervised way.
Additionally, the camera poses of each frame can be derived from GFlow,
allowing for rendering novel views of a video scene through changing camera
pose. By employing the explicit representation, we may readily conduct
scene-level or object-level editing as desired, underscoring its versatility
and power. Visit our project website at: https://littlepure2333.github.io/GFlow