簡潔致速：面向快速音視頻生成的單一流架構基礎模型

摘要

我們推出daVinci-MagiHuman——一個開源的人本生成式音視頻基礎模型。該模型採用單流Transformer架構，僅通過自注意力機制在統一標記序列中處理文本、視頻和音頻數據，實現同步音視頻生成。這種單流設計避免了多流或交叉注意力架構的複雜性，同時能利用標準訓練推理基礎設施輕鬆優化。該模型在人本生成場景表現卓越，可生成富有表現力的面部演繹、自然的口型語音協調、逼真的肢體運動以及精準的音畫同步。其支持多語言語音生成，涵蓋中文（普通話與粵語）、英語、日語、韓語、德語及法語。為實現高效推理，我們將單流主幹網絡與模型蒸餾、潛空間超分辨率以及Turbo VAE解碼器相結合，在單張H100 GPU上僅需2秒即可生成5秒長的256p視頻。在自動評估中，daVinci-MagiHuman在主流開源模型中獲得最高視覺質量與文本對齊度，並以14.60%的字錯誤率實現最佳語音清晰度。在2000次雙盲人工評測中，其對Ovi 1.1和LTX 2.3的勝率分別達80.0%和60.9%。我們將開源完整模型棧，包括基礎模型、蒸餾模型、超分辨率模型及推理代碼庫。

English

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

簡潔致速：面向快速音視頻生成的單一流架構基礎模型

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

摘要

Support