簡潔致速:面向快速音視頻生成的單一流架構基礎模型
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
March 23, 2026
作者: SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu
cs.AI
摘要
我們推出daVinci-MagiHuman——一個開源的人本生成式音視頻基礎模型。該模型採用單流Transformer架構,僅通過自注意力機制在統一標記序列中處理文本、視頻和音頻數據,實現同步音視頻生成。這種單流設計避免了多流或交叉注意力架構的複雜性,同時能利用標準訓練推理基礎設施輕鬆優化。該模型在人本生成場景表現卓越,可生成富有表現力的面部演繹、自然的口型語音協調、逼真的肢體運動以及精準的音畫同步。其支持多語言語音生成,涵蓋中文(普通話與粵語)、英語、日語、韓語、德語及法語。為實現高效推理,我們將單流主幹網絡與模型蒸餾、潛空間超分辨率以及Turbo VAE解碼器相結合,在單張H100 GPU上僅需2秒即可生成5秒長的256p視頻。在自動評估中,daVinci-MagiHuman在主流開源模型中獲得最高視覺質量與文本對齊度,並以14.60%的字錯誤率實現最佳語音清晰度。在2000次雙盲人工評測中,其對Ovi 1.1和LTX 2.3的勝率分別達80.0%和60.9%。我們將開源完整模型棧,包括基礎模型、蒸餾模型、超分辨率模型及推理代碼庫。
English
We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.