VideoREPA: 基盤モデルとの関係的アライメントを通じたビデオ生成のための物理学学習

要旨

最近のテキストからビデオ（T2V）拡散モデルの進歩により、高忠実度でリアルなビデオ合成が可能になりました。しかし、現在のT2Vモデルは、物理を正確に理解する能力が限られているため、物理的に妥当なコンテンツを生成するのに苦労することが多いです。我々は、T2Vモデル内の表現が物理理解の能力をある程度持っているものの、最近のビデオ自己教師あり学習手法の表現に比べて大きく遅れていることを発見しました。この問題を解決するため、我々はVideoREPAという新しいフレームワークを提案します。このフレームワークは、ビデオ理解基盤モデルから物理理解能力をT2Vモデルに蒸留し、トークンレベルの関係を整列させることで、物理理解のギャップを埋め、より物理的に妥当な生成を可能にします。具体的には、事前学習された強力なT2Vモデルの微調整に適したソフトガイダンスを提供するために、時空間整列を活用したトークン関係蒸留（TRD）損失を導入します。これは、従来の表現整列（REPA）手法とは異なる重要なアプローチです。我々の知る限り、VideoREPAはT2Vモデルの微調整、特に物理知識の注入のために設計された初めてのREPA手法です。実証評価では、VideoREPAがベースラインメソッドであるCogVideoXの物理的常識を大幅に向上させ、関連するベンチマークで著しい改善を達成し、直感的な物理と一致するビデオを生成する強力な能力を示しています。より多くのビデオ結果はhttps://videorepa.github.io/でご覧いただけます。

English

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

VideoREPA: 基盤モデルとの関係的アライメントを通じたビデオ生成のための物理学学習

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

要旨

Support