예측적 잠재 변수를 활용한 비디오 생성

초록

비디오 변분 자동인코더(VAE)는 시각 세계를 간결한 시공간 잠재 공간으로 매핑하여 훈련 효율성과 안정성을 향상함으로써 잠재 비디오 생성 모델링을 가능하게 합니다. 기존 비디오 VAE는 높은 수준의 재구성 품질을 달성하지만, 재구성의 지속적인 최적화가 반드시 향상된 생성 성능으로 이어지지는 않습니다. 비디오 잠재 공간의 확산 가능성을 향상시키는 방법은 여전히 중요하면서도 해결되지 않은 과제로 남아 있습니다. 본 연구에서는 예측적 세계 모델링 원리에 영감을 받아 예측 학습이 비디오 생성 모델링을 개선할 잠재력을 탐구합니다. 이를 위해 예측 학습과 비디오 재구성을 통합하는 간단하면서도 효과적인 예측 재구성 목표를 도입합니다. 구체적으로, 미래 프레임을 무작위로 제거하고 부분적인 과거 관측만을 인코딩하는 동시에, 디코더가 관측된 프레임을 재구성하고 미래 프레임을 동시에 예측하도록 훈련합니다. 이 설계는 잠재 공간이 시간적 예측 구조를 인코딩하고 비디오 동역학에 대한 보다 일관된 이해를 구축하도록 유도하여 생성 품질을 향상시킵니다. PV-VAE(예측 비디오 VAE)로 명명된 우리 모델은 UCF101 데이터셋에서 Wan2.2 VAE 대비 52% 빠른 수렴 속도와 34.42 FVD 개선으로 우수한 비디오 생성 성능을 달성했습니다. 나아가, 종합적인 분석을 통해 PV-VAE가 VAE 훈련과 함께 생성 성능이 향상되는 유리한 확장성을 보여줄 뿐만 아니라, 하위 단계의 비디오 이해 작업에서도 일관된 성능 향상을 가져와 시간적 일관성과 모션 사전 지식을 효과적으로 포착하는 잠재 공간을 구축했음을 입증합니다.

English

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

예측적 잠재 변수를 활용한 비디오 생성

Video Generation with Predictive Latents

초록

Support