VideoJAM：用於增強視頻模型中運動生成的聯合外觀-運動表示

摘要

儘管近年來取得了巨大進展，生成式視頻模型仍然難以捕捉真實世界的運動、動態和物理特性。我們指出這一限制源於傳統的像素重建目標，該目標使模型偏向外觀保真度，而忽略了運動一致性。為了解決這個問題，我們引入了VideoJAM，這是一個新穎的框架，通過鼓勵模型學習聯合外觀-運動表示，為視頻生成器注入了有效的運動先驗。VideoJAM由兩個互補的單元組成。在訓練期間，我們擴展了目標，以從單一學習表示中預測生成的像素及其對應的運動。在推斷期間，我們引入了Inner-Guidance，一種機制，通過利用模型自身不斷演進的運動預測作為動態引導信號，引導生成向一致運動方向發展。值得注意的是，我們的框架可以應用於任何視頻模型，只需進行最小的調整，無需修改訓練數據或調整模型規模。VideoJAM在運動一致性方面實現了最先進的性能，超越了高度競爭性的專有模型，同時提高了生成物的視覺質量。這些發現強調了外觀和運動可以互補，並且當有效整合時，可以提高視覺質量和視頻生成的一致性。項目網站：https://hila-chefer.github.io/videojam-paper.github.io/

English

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/