VideoJAM：ビデオモデルにおける動き生成の向上のための共同外観-動き表現

要旨

最近の著しい進歩にもかかわらず、生成ビデオモデルは依然として実世界の動き、ダイナミクス、物理を捉えるのに苦労しています。本研究では、この制限が従来のピクセル再構成目的から生じることを示します。この目的は、モデルを外観の忠実度を優先し、動きの一貫性を犠牲にする方向にバイアスをかけます。この問題に対処するために、私たちはVideoJAMという新しいフレームワークを導入します。このフレームワークは、ビデオ生成器に効果的な動き事前知識を植え付けるものであり、モデルが共同外観-動き表現を学習することを促します。VideoJAMは、2つの補完的なユニットで構成されています。トレーニング中、我々は目的を拡張して、単一の学習表現から生成されたピクセルとそれに対応する動きの両方を予測するようモデルを促します。推論中、Inner-Guidanceと呼ばれるメカニズムを導入し、モデル自体の進化する動き予測を動的なガイダンス信号として活用することで、生成物を一貫した動きに導きます。特筆すべきは、我々のフレームワークは、最小限の適応を必要とする任意のビデオモデルに適用でき、トレーニングデータの変更やモデルのスケーリングは必要ありません。VideoJAMは、動きの一貫性において最先端の性能を達成し、非常に競争力のあるプロプライエタリモデルを凌駕し、生成物の視覚的品質を向上させます。これらの知見は、外観と動きが補完的であり、効果的に統合されると、ビデオ生成の視覚的品質と一貫性の両方が向上することを強調しています。プロジェクトのウェブサイト：https://hila-chefer.github.io/videojam-paper.github.io/

English

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

VideoJAM：ビデオモデルにおける動き生成の向上のための共同外観-動き表現

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

要旨

Support