VideoREPA：通過與基礎模型的關係對齊學習視頻生成中的物理規律

摘要

近期，文本到视频（T2V）扩散模型的进展已能实现高保真且逼真的视频合成。然而，当前的T2V模型由于内在物理理解能力的局限，往往难以生成物理上合理的内容。我们发现，尽管T2V模型内部的表征具备一定的物理理解能力，但与近期视频自监督学习方法相比，仍存在显著差距。为此，我们提出了一种名为VideoREPA的新框架，该框架通过对齐令牌级关系，将视频理解基础模型中的物理理解能力蒸馏至T2V模型中，从而弥合物理理解的鸿沟，实现更为物理合理的生成。具体而言，我们引入了令牌关系蒸馏（TRD）损失，利用时空对齐提供适用于微调强大预训练T2V模型的软指导，这是对先前表征对齐（REPA）方法的关键突破。据我们所知，VideoREPA是首个专为微调T2V模型并特别用于注入物理知识而设计的REPA方法。实证评估表明，VideoREPA显著增强了基线方法CogVideoX的物理常识，在相关基准测试上取得了显著进步，并展现出生成符合直觉物理视频的强大能力。更多视频结果请访问https://videorepa.github.io/。

English

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

VideoREPA：通過與基礎模型的關係對齊學習視頻生成中的物理規律

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

摘要

Support