基于预测性隐变量的视频生成

摘要

视频变分自编码器（VAE）通过将视觉世界映射至紧凑的时空潜在空间，实现了潜在视频生成建模，从而提升了训练效率与稳定性。现有视频VAE虽已实现可观的重建质量，但持续优化重建效果未必能转化为生成性能的提升。如何增强视频潜在空间的可扩散性仍是关键且尚未解决的挑战。受预测性世界建模原理启发，本研究探索了预测学习对改进视频生成建模的潜力。为此，我们提出了一种简单有效的预测性重建目标，将预测学习与视频重建相融合。具体而言，我们随机丢弃未来帧并仅编码部分历史观测帧，同时训练解码器同步重建已观测帧并预测未来帧。该设计促使潜在空间编码具有时间预测性的结构，建立对视频动态更连贯的理解，从而提升生成质量。我们提出的预测性视频VAE（PV-VAE）在视频生成任务中表现卓越，在UCF101数据集上相比Wan2.2 VAE收敛速度提升52%，FVD指标改善34.42。进一步综合分析表明，PV-VAE不仅具备良好的可扩展性（其生成性能随VAE训练同步提升），还能在下游视频理解任务中带来持续增益，印证了其潜在空间能有效捕捉时序连贯性与运动先验。

English

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

基于预测性隐变量的视频生成

Video Generation with Predictive Latents

摘要

Support