Kling-Avatar:基于多模态指令的级联式长时程虚拟角色动画合成框架
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
September 11, 2025
作者: Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
cs.AI
摘要
近期,音频驱动虚拟形象视频生成技术取得了显著进展,极大地提升了视听真实感。然而,现有方法仅将指令条件视为由声学或视觉线索驱动的低级追踪,未能建模指令所传达的交流意图。这一局限削弱了其叙事连贯性与角色表现力。为弥合此差距,我们提出了Kling-Avatar,一个创新的级联框架,它统一了多模态指令理解与超写实肖像生成。我们的方法采用两阶段流程:第一阶段,我们设计了一个多模态大语言模型(MLLM)导演,基于多样指令信号生成蓝图视频,从而掌控角色动作与情感等高层语义;第二阶段,在蓝图关键帧的引导下,采用首尾帧策略并行生成多个子片段。这一从全局到局部的框架在精细保留细节的同时,忠实编码了多模态指令背后的高层意图。我们的并行架构还实现了长视频的快速稳定生成,使其适用于数字人直播和视频博客等实际应用。为全面评估我们的方法,我们构建了一个包含375个精选样本的基准测试集,涵盖多样指令与挑战性场景。大量实验表明,Kling-Avatar能够生成生动流畅、长达1080p分辨率、48帧每秒的视频,在唇形同步精度、情感与动态表现力、指令可控性、身份保持及跨域泛化能力上均表现出色。这些成果确立了Kling-Avatar作为语义基础、高保真音频驱动虚拟形象合成的新标杆。
English
Recent advances in audio-driven avatar video generation have significantly
enhanced audio-visual realism. However, existing methods treat instruction
conditioning merely as low-level tracking driven by acoustic or visual cues,
without modeling the communicative purpose conveyed by the instructions. This
limitation compromises their narrative coherence and character expressiveness.
To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that
unifies multimodal instruction understanding with photorealistic portrait
generation. Our approach adopts a two-stage pipeline. In the first stage, we
design a multimodal large language model (MLLM) director that produces a
blueprint video conditioned on diverse instruction signals, thereby governing
high-level semantics such as character motion and emotions. In the second
stage, guided by blueprint keyframes, we generate multiple sub-clips in
parallel using a first-last frame strategy. This global-to-local framework
preserves fine-grained details while faithfully encoding the high-level intent
behind multimodal instructions. Our parallel architecture also enables fast and
stable generation of long-duration videos, making it suitable for real-world
applications such as digital human livestreaming and vlogging. To
comprehensively evaluate our method, we construct a benchmark of 375 curated
samples covering diverse instructions and challenging scenarios. Extensive
experiments demonstrate that Kling-Avatar is capable of generating vivid,
fluent, long-duration videos at up to 1080p and 48 fps, achieving superior
performance in lip synchronization accuracy, emotion and dynamic
expressiveness, instruction controllability, identity preservation, and
cross-domain generalization. These results establish Kling-Avatar as a new
benchmark for semantically grounded, high-fidelity audio-driven avatar
synthesis.