単純さによる高速化：高速な音声・映像生成基盤モデルのためのシングルストリームアーキテクチャ

要旨

我々は、人間中心の生成に特化したオープンソースの音声-映像生成基盤モデル「daVinci-MagiHuman」を提案する。daVinci-MagiHumanは、テキスト・映像・音声を統一されたトークン系列として単一ストリームのTransformerで処理し、自己注意機構のみを用いて同期した映像と音声を共同生成する。この単一ストリーム設計により、マルチストリームやクロスアテンション構造の複雑さを回避しつつ、標準的な学習・推論インフラで最適化が容易である。本モデルは人間中心のシナリオにおいて特に優れ、豊かな表情表現、自然な発話と表情の連動、写実的な身体動作、精密な音声-映像同期を実現する。中国語（標準語・広東語）、英語、日本語、韓国語、ドイツ語、フランス語に跨る多言語音声生成をサポートする。効率的な推論のために、単一ストリーム基盤にモデル蒸留、潜在空間超解像、Turbo VAEデコーダを組み合わせ、単一H100 GPUで5秒間の256p映像を2秒で生成可能とする。自動評価では、daVinci-MagiHumanは主要オープンモデル中最高の視覚品質とテキスト整合性を達成し、音声明瞭度では最低の語誤り率（14.60%）を記録した。2000回の比較による人間評価では、Ovi 1.1対して80.0%、LTX 2.3に対して60.9%の勝率を達成した。基本モデル、蒸留モデル、超解像モデル、推論コードベースを含む完全なモデルスタックをオープンソースとして公開する。

English

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

単純さによる高速化：高速な音声・映像生成基盤モデルのためのシングルストリームアーキテクチャ

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

要旨

Support