PhysCtrl: 制御可能かつ物理的基盤を持つビデオ生成のための生成的物理モデル

要旨

既存のビデオ生成モデルは、テキストや画像からフォトリアルなビデオを生成する点で優れていますが、物理的な妥当性や3D制御性に欠けることが多いです。これらの制限を克服するため、物理パラメータと力の制御を備えた物理ベースの画像からビデオ生成のための新しいフレームワーク、PhysCtrlを提案します。その中核となるのは、物理パラメータと加えられた力に基づいて条件付けされた拡散モデルを通じて、4つの材料（弾性体、砂、プラスチシン、剛体）にわたる物理ダイナミクスの分布を学習する生成物理ネットワークです。物理ダイナミクスを3D点軌跡として表現し、物理シミュレータによって生成された550Kのアニメーションからなる大規模な合成データセットで学習を行います。拡散モデルを、粒子間の相互作用を模倣する新しい時空間的注意ブロックで強化し、学習中に物理ベースの制約を取り入れることで物理的な妥当性を確保します。実験結果から、PhysCtrlが現実的で物理ベースの運動軌跡を生成し、それらを画像からビデオ生成モデルに適用することで、視覚品質と物理的妥当性の両面で既存の手法を上回る高忠実度で制御可能なビデオを生成することが示されています。プロジェクトページ: https://cwchenwang.github.io/physctrl

English

Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

PhysCtrl: 制御可能かつ物理的基盤を持つビデオ生成のための生成的物理モデル

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

要旨

Support