FSVideo：高度压缩潜空间中的快速视频扩散模型

摘要

我们提出FSVideo，一种基于快速变换器的图像到视频（I2V）扩散框架。该框架的核心构建模块包括：1）新型视频自编码器，其具备高度压缩的潜在空间（时空下采样比达64×64×4），在保证重建质量的同时实现高效压缩；2）采用新型层间记忆设计的扩散变换器（DIT）架构，通过增强层间信息流与上下文复用提升性能；3）基于多步DIT上采样器的多分辨率生成策略，有效提升视频保真度。我们的最终模型包含140亿参数的基础DIT模型和140亿参数的上采样DIT模型，在性能上可与主流开源模型相媲美，同时生成速度提升一个数量级。本报告将详细阐述模型设计及训练策略。

English

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

FSVideo：高度压缩潜空间中的快速视频扩散模型

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

摘要

Support