FSVideo:高度压缩潜空间中的快速视频扩散模型
FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
February 2, 2026
作者: FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, Yuxin Zhang
cs.AI
摘要
我们提出FSVideo,一种基于快速变换器的图像到视频(I2V)扩散框架。该框架的核心构建模块包括:1)新型视频自编码器,其具备高度压缩的潜在空间(时空下采样比达64×64×4),在保证重建质量的同时实现高效压缩;2)采用新型层间记忆设计的扩散变换器(DIT)架构,通过增强层间信息流与上下文复用提升性能;3)基于多步DIT上采样器的多分辨率生成策略,有效提升视频保真度。我们的最终模型包含140亿参数的基础DIT模型和140亿参数的上采样DIT模型,在性能上可与主流开源模型相媲美,同时生成速度提升一个数量级。本报告将详细阐述模型设计及训练策略。
English
We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.