ReVision:基于显式三维物理建模的高质量低成本视频生成,专为复杂运动与交互设计
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction
April 30, 2025
作者: Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille
cs.AI
摘要
近年来,视频生成技术取得了显著进展。然而,在生成复杂运动和交互方面仍存在挑战。为应对这些挑战,我们提出了ReVision,一个即插即用的框架,它将参数化的三维物理知识显式地整合到预训练的条件视频生成模型中,显著提升了其生成高质量复杂运动和交互视频的能力。具体而言,ReVision包含三个阶段。首先,利用视频扩散模型生成粗略视频;接着,从该粗略视频中提取一组二维和三维特征,构建以对象为中心的三维表示,并通过我们提出的参数化物理先验模型进行优化,生成精确的三维运动序列;最后,将这一优化后的运动序列作为额外条件反馈至同一视频扩散模型,从而即使在涉及复杂动作和交互的场景下,也能生成运动一致性的视频。我们在Stable Video Diffusion上验证了该方法的有效性,ReVision显著提高了运动的真实性和连贯性。值得注意的是,仅拥有15亿参数的ReVision,在复杂视频生成任务上,大幅超越了拥有超过130亿参数的最先进视频生成模型。我们的结果表明,通过融入三维物理知识,即便是相对较小的视频扩散模型,也能以更高的真实感和可控性生成复杂的运动和交互,为物理可信的视频生成提供了一个有前景的解决方案。
English
In recent years, video generation has seen significant advancements. However,
challenges still persist in generating complex motions and interactions. To
address these challenges, we introduce ReVision, a plug-and-play framework that
explicitly integrates parameterized 3D physical knowledge into a pretrained
conditional video generation model, significantly enhancing its ability to
generate high-quality videos with complex motion and interactions.
Specifically, ReVision consists of three stages. First, a video diffusion model
is used to generate a coarse video. Next, we extract a set of 2D and 3D
features from the coarse video to construct a 3D object-centric representation,
which is then refined by our proposed parameterized physical prior model to
produce an accurate 3D motion sequence. Finally, this refined motion sequence
is fed back into the same video diffusion model as additional conditioning,
enabling the generation of motion-consistent videos, even in scenarios
involving complex actions and interactions. We validate the effectiveness of
our approach on Stable Video Diffusion, where ReVision significantly improves
motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even
outperforms a state-of-the-art video generation model with over 13B parameters
on complex video generation by a substantial margin. Our results suggest that,
by incorporating 3D physical knowledge, even a relatively small video diffusion
model can generate complex motions and interactions with greater realism and
controllability, offering a promising solution for physically plausible video
generation.