MoCha: 영화 수준의 대화형 캐릭터 합성을 향하여

초록

최근 비디오 생성 기술의 발전은 인상적인 움직임의 사실감을 달성했지만, 자동화된 영화 및 애니메이션 생성에 있어 중요한 요소인 캐릭터 중심 스토리텔링을 종종 간과해 왔습니다. 본 논문에서는 음성과 텍스트로부터 직접 캐릭터 애니메이션을 생성하는 더 현실적인 과제인 Talking Characters를 소개합니다. Talking Head와 달리, Talking Characters는 얼굴 영역을 넘어 하나 이상의 캐릭터 전체 초상을 생성하는 것을 목표로 합니다. 이 논문에서는 이러한 유형의 첫 번째 모델인 MoCha를 제안합니다. 비디오와 음성 간의 정확한 동기화를 보장하기 위해, 음성과 비디오 토큰을 효과적으로 정렬하는 음성-비디오 윈도우 어텐션 메커니즘을 제안합니다. 대규모 음성 레이블이 달린 비디오 데이터셋의 부족 문제를 해결하기 위해, 음성 레이블과 텍스트 레이블이 달린 비디오 데이터를 모두 활용하는 공동 학습 전략을 도입하여 다양한 캐릭터 동작에 대한 일반화를 크게 개선했습니다. 또한, 캐릭터 태그가 포함된 구조화된 프롬프트 템플릿을 설계하여, 처음으로 턴 기반 대화를 통해 다중 캐릭터 대화가 가능하도록 하여 AI 생성 캐릭터가 시네마틱 일관성을 유지하며 상황 인식 대화를 나눌 수 있게 했습니다. 인간 선호도 연구 및 벤치마크 비교를 포함한 광범위한 정성적 및 정량적 평가를 통해, MoCha가 AI 생성 시네마틱 스토리텔링 분야에서 새로운 기준을 세우며 우수한 사실감, 표현력, 제어 가능성 및 일반화를 달성했음을 입증했습니다.

English

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.