PhyCo: 生成モーションの制御可能な物理的プリオを学習する

要旨

現代のビデオ拡散モデルは外観合成において優れた性能を発揮するが、物理的一貫性には依然として課題がある。物体の浮動、衝突時の反発の非現実性、素材の反応と物性の不一致などが生じやすい。本論文では、ビデオ生成に連続的で解釈可能、かつ物理的に基礎付けられた制御を導入するフレームワーク「PhyCo」を提案する。我々のアプローチは以下の3つの主要要素を統合する。(i) 多様なシナリオにおいて摩擦、反発係数、変形、力を系統的に変化させた10万以上に及ぶ写真的シミュレーションビデオからなる大規模データセット、(ii) ピクセル位置合わせされた物理特性マップを条件とするControlNetを用いた、事前学習済み拡散モデルの物理教師付きファインチューニング、(iii) 物理特性に特化した質問を用いてファインチューニングされた視覚言語モデル(VLM)が生成ビデオを評価し、微分可能なフィードバックを提供するVLM誘導型報酬最適化。この組み合わせにより、推論時にシミュレータや幾何学復元を一切必要とせず、物理属性の変異を通じて物理的一貫性と制御性を備えた生成が可能となる。Physics-IQベンチマークでは、PhyCoは強力なベースラインを大幅に上回る物理的リアリズムを実現し、人間による評価では物理属性に対するより明確で忠実な制御が確認された。本成果は、合成的な学習環境を超えて一般化する、物理的一貫性のある制御可能な生成ビデオモデルへのスケーラブルな道筋を示すものである。

English

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

PhyCo: 生成モーションの制御可能な物理的プリオを学習する

PhyCo: Learning Controllable Physical Priors for Generative Motion

要旨

Support