MoCha: 映画品質の会話キャラクター合成に向けて

要旨

近年のビデオ生成技術は印象的なモーションリアリズムを実現してきたが、キャラクター主導のストーリーテリングという自動化された映画・アニメーション生成における重要な課題を見落としがちである。本論文では、音声とテキストから直接話すキャラクターアニメーションを生成する、より現実的なタスクである「Talking Characters」を提案する。Talking Headとは異なり、Talking Charactersは顔領域を超えた1人または複数のキャラクターの全身像を生成することを目指す。本論文では、話すキャラクターを生成する初の手法としてMoChaを提案する。ビデオと音声の正確な同期を確保するため、音声とビデオトークンを効果的に整列させる音声-ビデオウィンドウアテンションメカニズムを提案する。大規模な音声ラベル付きビデオデータセットの不足に対処するため、音声ラベル付きとテキストラベル付きのビデオデータの両方を活用する共同学習戦略を導入し、多様なキャラクターアクションにわたる汎化性能を大幅に向上させる。さらに、キャラクタータグ付きの構造化プロンプトテンプレートを設計し、初めてターンベースの対話による複数キャラクターの会話を可能にし、AI生成キャラクターが文脈を意識した映画的な一貫性のある会話を展開できるようにする。人間の嗜好調査やベンチマーク比較を含む広範な定性的・定量的評価により、MoChaがAI生成の映画的ストーリーテリングにおいて新たな基準を確立し、優れたリアリズム、表現力、制御性、汎化性能を達成していることを実証する。

English

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.