ChatPaper.aiChatPaper

ReVision:基于显式三维物理建模的高质量低成本视频生成,专为复杂运动与交互设计

ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

April 30, 2025
作者: Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille
cs.AI

摘要

近年来,视频生成技术取得了显著进展。然而,在生成复杂运动和交互方面仍存在挑战。为应对这些挑战,我们提出了ReVision,一个即插即用的框架,它将参数化的三维物理知识显式地整合到预训练的条件视频生成模型中,显著提升了其生成高质量复杂运动和交互视频的能力。具体而言,ReVision包含三个阶段。首先,利用视频扩散模型生成粗略视频;接着,从该粗略视频中提取一组二维和三维特征,构建以对象为中心的三维表示,并通过我们提出的参数化物理先验模型进行优化,生成精确的三维运动序列;最后,将这一优化后的运动序列作为额外条件反馈至同一视频扩散模型,从而即使在涉及复杂动作和交互的场景下,也能生成运动一致性的视频。我们在Stable Video Diffusion上验证了该方法的有效性,ReVision显著提高了运动的真实性和连贯性。值得注意的是,仅拥有15亿参数的ReVision,在复杂视频生成任务上,大幅超越了拥有超过130亿参数的最先进视频生成模型。我们的结果表明,通过融入三维物理知识,即便是相对较小的视频扩散模型,也能以更高的真实感和可控性生成复杂的运动和交互,为物理可信的视频生成提供了一个有前景的解决方案。
English
In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.
PDF122May 4, 2025