ReVision：複雑な動きと相互作用を明示的な3D物理モデリングで実現する高品質・低コストなビデオ生成

要旨

近年、ビデオ生成技術は大きな進歩を遂げてきた。しかし、複雑な動きや相互作用を生成する上では依然として課題が残されている。これらの課題に対処するため、本研究ではReVisionを提案する。これは、事前学習済みの条件付きビデオ生成モデルにパラメータ化された3D物理知識を明示的に統合するプラグアンドプレイフレームワークであり、複雑な動きや相互作用を含む高品質なビデオ生成能力を大幅に向上させる。具体的には、ReVisionは3つの段階で構成される。まず、ビデオ拡散モデルを使用して粗いビデオを生成する。次に、この粗いビデオから2Dおよび3Dの特徴量を抽出し、3Dオブジェクト中心の表現を構築する。その後、提案するパラメータ化された物理事前モデルによって精緻化され、正確な3Dモーションシーケンスを生成する。最後に、この精緻化されたモーションシーケンスを追加の条件として同じビデオ拡散モデルにフィードバックし、複雑なアクションや相互作用を含むシナリオでもモーションに一貫性のあるビデオを生成可能にする。我々は、Stable Video Diffusionにおいて本アプローチの有効性を検証し、ReVisionがモーションの忠実度と一貫性を大幅に向上させることを確認した。注目すべきは、わずか1.5Bのパラメータで、13B以上のパラメータを持つ最先端のビデオ生成モデルを複雑なビデオ生成において大幅に上回る性能を示したことである。これらの結果は、3D物理知識を組み込むことで、比較的小規模なビデオ拡散モデルでも、より現実的で制御可能な複雑な動きや相互作用を生成できる可能性を示しており、物理的に妥当なビデオ生成の有望な解決策を提供するものである。

English

In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.