ViDAR:基于单目输入的视频扩散感知四维重建
ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs
June 23, 2025
作者: Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Zhensong Zhang, Gregory Slabaugh, Eduardo Pérez-Pellitero
cs.AI
摘要
动态新视角合成旨在从任意视角生成移动主体的逼真视图。这一任务在依赖单目视频时尤为困难,因为从运动中分离结构是一个不适定问题,且监督信息稀缺。我们提出了视频扩散感知重建(ViDAR),这是一种新颖的四维重建框架,它利用个性化扩散模型来合成伪多视角监督信号,用于训练高斯溅射表示。通过以场景特定特征为条件,ViDAR在恢复精细外观细节的同时,减轻了由单目模糊性引入的伪影。为了解决基于扩散的监督在时空上的不一致性,我们提出了一种扩散感知损失函数和相机姿态优化策略,使合成视图与底层场景几何对齐。在具有极端视角变化的挑战性基准DyCheck上的实验表明,ViDAR在视觉质量和几何一致性方面均优于所有最先进的基线方法。我们进一步强调了ViDAR在动态区域上相对于基线的显著改进,并提供了一个新的基准来比较在重建场景中运动丰富部分时的性能表现。项目页面:https://vidar-4d.github.io
English
Dynamic Novel View Synthesis aims to generate photorealistic views of moving
subjects from arbitrary viewpoints. This task is particularly challenging when
relying on monocular video, where disentangling structure from motion is
ill-posed and supervision is scarce. We introduce Video Diffusion-Aware
Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages
personalised diffusion models to synthesise a pseudo multi-view supervision
signal for training a Gaussian splatting representation. By conditioning on
scene-specific features, ViDAR recovers fine-grained appearance details while
mitigating artefacts introduced by monocular ambiguity. To address the
spatio-temporal inconsistency of diffusion-based supervision, we propose a
diffusion-aware loss function and a camera pose optimisation strategy that
aligns synthetic views with the underlying scene geometry. Experiments on
DyCheck, a challenging benchmark with extreme viewpoint variation, show that
ViDAR outperforms all state-of-the-art baselines in visual quality and
geometric consistency. We further highlight ViDAR's strong improvement over
baselines on dynamic regions and provide a new benchmark to compare performance
in reconstructing motion-rich parts of the scene. Project page:
https://vidar-4d.github.io