Step-Video-T2V 技術報告:影片基礎模型的實踐、挑戰和未來。
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
February 14, 2025
作者: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
cs.AI
摘要
我們提出了 Step-Video-T2V,一種最先進的文本轉視頻預訓練模型,具有 30B 個參數,能夠生成長達 204 幀的視頻。我們設計了一種深度壓縮變分自編碼器 Video-VAE 用於視頻生成任務,實現了 16x16 的空間和 8x 的時間壓縮比,同時保持了出色的視頻重建質量。用戶提示信息使用兩個雙語文本編碼器進行編碼,以處理英文和中文。通過使用 Flow Matching 訓練具有 3D 全注意力的 DiT 來將輸入噪聲去噪為潛在幀。我們應用基於視頻的 DPO 方法 Video-DPO 來減少瑕疵並提高生成視頻的視覺質量。我們還詳細介紹了我們的訓練策略,並分享了關鍵觀察和見解。Step-Video-T2V 的性能在一個新的視頻生成基準 Step-Video-T2V-Eval 上進行了評估,與開源和商業引擎相比,展示了其領先的文本轉視頻質量。此外,我們討論了當前基於擴散的模型範式的局限性,並概述了視頻基礎模型的未來方向。我們將 Step-Video-T2V 和 Step-Video-T2V-Eval 都提供在 https://github.com/stepfun-ai/Step-Video-T2V。在線版本也可從 https://yuewen.cn/videos 訪問。我們的目標是加速視頻基礎模型的創新,並賦予視頻內容創作者更大的能力。
English
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model
with 30B parameters and the ability to generate videos up to 204 frames in
length. A deep compression Variational Autoencoder, Video-VAE, is designed for
video generation tasks, achieving 16x16 spatial and 8x temporal compression
ratios, while maintaining exceptional video reconstruction quality. User
prompts are encoded using two bilingual text encoders to handle both English
and Chinese. A DiT with 3D full attention is trained using Flow Matching and is
employed to denoise input noise into latent frames. A video-based DPO approach,
Video-DPO, is applied to reduce artifacts and improve the visual quality of the
generated videos. We also detail our training strategies and share key
observations and insights. Step-Video-T2V's performance is evaluated on a novel
video generation benchmark, Step-Video-T2V-Eval, demonstrating its
state-of-the-art text-to-video quality when compared with both open-source and
commercial engines. Additionally, we discuss the limitations of current
diffusion-based model paradigm and outline future directions for video
foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval
available at https://github.com/stepfun-ai/Step-Video-T2V. The online version
can be accessed from https://yuewen.cn/videos as well. Our goal is to
accelerate the innovation of video foundation models and empower video content
creators.Summary
AI-Generated Summary