MoCha：邁向電影級別的角色語音合成

摘要

近期在視頻生成領域的進展已實現了令人印象深刻的動作真實感，然而這些技術往往忽略了角色驅動的敘事，這對於自動化電影和動畫生成至關重要。我們引入了「說話角色」這一更為真實的任務，旨在直接從語音和文本生成說話角色的動畫。與「說話頭部」不同，「說話角色」致力於生成一個或多個角色的完整肖像，超越面部區域。在本論文中，我們提出了MoCha，這是首個生成說話角色的系統。為了確保視頻與語音的精確同步，我們提出了一種語音-視頻窗口注意力機制，有效對齊語音和視頻標記。針對大規模語音標註視頻數據集的稀缺問題，我們引入了一種聯合訓練策略，利用語音標註和文本標註的視頻數據，顯著提升了對多樣化角色動作的泛化能力。我們還設計了帶有角色標籤的結構化提示模板，首次實現了多角色對話的輪流對話——使AI生成的角色能夠進行上下文感知的對話，並保持電影般的連貫性。廣泛的定性和定量評估，包括人類偏好研究和基準比較，表明MoCha為AI生成的電影敘事樹立了新標準，在真實感、表現力、可控性和泛化能力方面均達到了卓越水平。

English

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.