予測的潜在変数を用いた映像生成

要旨

Video Variational Autoencoder（VAE）は、視覚世界をコンパクトな時空間潜在空間にマッピングすることで、潜在的な映像生成モデリングを可能にし、学習効率と安定性を向上させます。既存の映像VAEは優れた再構成品質を達成していますが、再構成の最適化を継続しても生成性能の向上には必ずしも結びつきません。映像潜在表現の拡散性を如何に高めるかは、重要かつ未解決の課題です。本研究では、予測的世界モデリングの原理に着想を得て、映像生成モデリングを改善するための予測的学習の可能性を探求します。この目的のために、予測的学習と映像再構成を統合する簡潔かつ効果的な予測的再構成目標を提案します。具体的には、未来フレームをランダムに除外し、部分的な過去観測のみを符号化しながら、デコーダーが観測されたフレームの再構成と未来フレームの予測を同時に行うように学習します。この設計により、潜在空間は時間的予測構造を符号化し、映像ダイナミクスに対するより一貫した理解を構築するよう促され、生成品質が向上します。提案手法であるPredictive Video VAE（PV-VAE）は、映像生成において優れた性能を発揮し、UCF101データセットにおいてWan2.2 VAEと比較して収束速度が52%向上し、34.42 FVDの改善を達成しました。さらに、詳細な分析により、PV-VAEがVAEの学習に伴って生成性能が向上する良好なスケーラビリティを示すだけでなく、下流の映像理解タスクにおいても一貫した改善をもたらすことが実証され、時間的一貫性と動きの事前分布を効果的に捉えた潜在空間の有効性が確認されました。

English

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

予測的潜在変数を用いた映像生成

Video Generation with Predictive Latents

要旨

Support