단순함으로 구현한 속도: 빠른 오디오-비디오 생성 파운데이션 모델을 위한 단일 스트림 아키텍처

초록

우리는 인간 중심 생성(task)을 위한 오픈소스 오디오-비디오 생성 파운데이션 모델인 daVinci-MagiHuman을 제안합니다. daVinci-MagiHuman은 텍스트, 비디오, 오디오를 통합 토큰 시퀀스로 처리하는 단일 스트림(single-stream) Transformer를 통해 동기화된 비디오와 오디오를 결합(jointly)하여 생성합니다. 이 단일 스트림 설계는 다중 스트림(multi-stream) 또는 교차 주의(cross-attention) 아키텍처의 복잡성을 피하면서도 표준 학습 및 추론 인프라를 사용하여 최적화하기 쉽습니다. 본 모델은 특히 인간 중심 시나리오에서 강점을 보이며, 표현력 있는 얼굴 연기, 자연스러운 발화-표정 조율, 사실적인 신체 움직임, 그리고 정확한 오디오-비디오 동기화를 구현합니다. 중국어(보통화 및 광둥어), 영어, 일본어, 한국어, 독일어, 프랑스어에 걸친 다국어 음성 생성을 지원합니다. 효율적인 추론을 위해 단일 스트림 백본에 모델 경량화(model distillation), 잠재 공간 초해상도(latent-space super-resolution), Turbo VAE 디코더를 결합하여 단일 H100 GPU에서 5초 길이의 256p 비디오를 2초 만에 생성할 수 있습니다. 자동 평가에서 daVinci-MagiHuman은 주요 오픈 모델 중 최고의 시각적 품질과 텍스트 정확도(text alignment)를 달성했으며, 음성 명료도 측정에서 가장 낮은 단어 오류율(14.60%)을 기록했습니다. 2000건의 비교를 통한 인간 주도 쌍별 평가(pairwise human evaluation)에서는 Ovi 1.1 대비 80.0%, LTX 2.3 대비 60.9%의 승률을achieved achieved 달성했습니다. 우리는 기본 모델, 경량화 모델, 초해상도 모델 및 추론 코드베이스를 포함한 전체 모델 스택을 오픈소스로 공개합니다.

English

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

단순함으로 구현한 속도: 빠른 오디오-비디오 생성 파운데이션 모델을 위한 단일 스트림 아키텍처

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

초록

Support