自适应一维视频扩散自动编码器
Adaptive 1D Video Diffusion Autoencoder
February 4, 2026
作者: Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang, Xihui Liu
cs.AI
摘要
当前视频生成模型主要依赖将像素空间视频压缩为潜在表征的视频自编码器。然而现有视频自编码器存在三大局限:(1)固定速率压缩机制在简单视频上造成令牌浪费;(2)僵化的CNN架构无法实现可变长度潜在建模;(3)确定性解码器难以从压缩潜在表征中还原恰当细节。为解决这些问题,我们提出一维扩散视频自编码器(One-DVA)——基于Transformer的自适应一维编码与扩散解码框架。该编码器采用基于查询的视觉Transformer提取时空特征并生成潜在表征,同时通过可变长度丢弃机制动态调整潜在序列长度。解码器则是以潜在表征为条件输入的像素空间扩散Transformer,用于重建视频。通过两阶段训练策略,One-DVA在相同压缩比下实现了与3D-CNN VAE相当的重建指标性能。更重要的是,它支持自适应压缩,从而可实现更高压缩比。为更好支持下游潜在生成任务,我们进一步对One-DVA潜在分布进行生成建模正则化,并对其解码器进行微调以缓解生成过程引起的伪影问题。
English
Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.