简约致速：面向快速音视频生成的单流基础模型架构

摘要

我们推出daVinci-MagiHuman——一款面向人本生成的开源音视频生成基础模型。该模型通过单流Transformer架构，仅依赖自注意力机制在统一标记序列中处理文本、视频和音频数据，实现同步音视频生成。这种单流设计避免了多流或交叉注意力架构的复杂性，同时能利用标准训练推理基础设施轻松优化。该模型在人本生成场景表现卓越，可生成富有表现力的面部表演、自然的语音表情协调、逼真的身体运动以及精准的音画同步效果，支持汉语（普通话与粤语）、英语、日语、韩语、德语、法语等多语言语音生成。为提升推理效率，我们结合模型蒸馏、潜空间超分辨率和Turbo VAE解码器技术，在单张H100 GPU上仅需2秒即可生成5秒时长的256p视频。自动评估显示，daVinci-MagiHuman在主流开源模型中取得最高视觉质量与文本对齐度，语音可懂度词错误率最低（14.60%）。在2000次人工对比评估中，其相对于Ovi 1.1和LTX 2.3的胜率分别达到80.0%和60.9%。我们已开源完整模型栈，包括基础模型、蒸馏模型、超分辨率模型及推理代码库。

English

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

简约致速：面向快速音视频生成的单流基础模型架构

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

摘要

Support