基於預測性潛在變量的影片生成
Video Generation with Predictive Latents
May 4, 2026
作者: Yian Zhao, Feng Wang, Qiushan Guo, Chang Liu, Xiangyang Ji, Jian Zhang, Jie Chen
cs.AI
摘要
影片變分自編碼器(VAE)透過將視覺世界映射至緊湊的時空潛在空間,實現了潛在影片生成建模,從而提升訓練效率與穩定性。儘管現有影片VAE已能實現令人滿意的重建品質,但持續優化重建並不一定能轉化為生成性能的提升。如何增強影片潛在空間的「可擴散性」仍是關鍵且尚未解決的難題。本研究受預測性世界建模原理啟發,探索預測學習提升影片生成建模的潛力。為此,我們提出一種簡潔有效的預測性重建目標,將預測學習與影片重建相結合。具體而言,我們隨機丟棄未來影格,僅對部分過往觀測進行編碼,同時訓練解碼器同步重建已觀測影格並預測未來影格。此設計促使潛在空間編碼具時間預測性的結構,建立對影片動態更連貫的理解,從而提升生成品質。我們提出的預測性影片VAE(PV-VAE)在影片生成任務中表現卓越,於UCF101數據集上相較Wan2.2 VAE收斂速度提升52%,FVD指標改善34.42。進一步綜合分析表明,PV-VAE不僅展現優異的可擴展性(其生成性能隨VAE訓練同步提升),還能為下游影片理解任務帶來持續增益,驗證了該潛在空間能有效捕捉時間連貫性與運動先驗。
English
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.