ReVision：基於顯式三維物理建模的高質量低成本視頻生成，實現複雜運動與交互

摘要

近年來，影片生成技術取得了顯著進展。然而，在生成複雜動作和互動方面仍存在挑戰。為應對這些挑戰，我們提出了ReVision，這是一個即插即用的框架，它將參數化的3D物理知識明確地整合到預訓練的條件式影片生成模型中，顯著提升了生成高質量複雜動作和互動影片的能力。具體而言，ReVision包含三個階段。首先，使用影片擴散模型生成粗略影片。接著，從粗略影片中提取一組2D和3D特徵，構建以物體為中心的3D表示，並通過我們提出的參數化物理先驗模型進行精煉，以產生精確的3D動作序列。最後，將這精煉後的動作序列作為額外條件反饋到同一影片擴散模型中，從而生成動作一致的影片，即使在涉及複雜動作和互動的場景中也能實現。我們在Stable Video Diffusion上驗證了該方法的有效性，ReVision顯著提升了動作的真實性和連貫性。值得注意的是，僅憑1.5B參數，它在複雜影片生成上的表現甚至大幅超越了擁有超過13B參數的頂尖影片生成模型。我們的結果表明，通過融入3D物理知識，即使是相對較小的影片擴散模型也能以更高的真實性和可控性生成複雜動作和互動，為物理上合理的影片生成提供了一個有前景的解決方案。

English

In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.