Kling-Avatar：基於多模態指令的級聯式長時程虛擬角色動畫合成

摘要

近期，音頻驅動的虛擬形象視頻生成技術取得了顯著進展，大幅提升了音視覺的真實感。然而，現有方法僅將指令條件視為由聲學或視覺線索驅動的低層次追蹤，而未能建模指令所傳達的交流意圖。這一限制削弱了其敘事連貫性和角色表現力。為彌補這一不足，我們提出了Kling-Avatar，這是一種新穎的級聯框架，它將多模態指令理解與逼真肖像生成相統一。我們的方法採用兩階段流程。在第一階段，我們設計了一個多模態大語言模型（MLLM）導演，該導演基於多樣化的指令信號生成藍圖視頻，從而控制角色動作和情感等高層次語義。在第二階段，在藍圖關鍵幀的引導下，我們採用首尾幀策略並行生成多個子片段。這種從全局到局部的框架在精細保留細節的同時，忠實地編碼了多模態指令背後的高層意圖。我們的並行架構還支持快速穩定地生成長時視頻，使其適用於數字人直播和視頻博客等實際應用。為全面評估我們的方法，我們構建了一個包含375個精選樣本的基準測試集，涵蓋了多樣化的指令和具有挑戰性的場景。大量實驗表明，Kling-Avatar能夠生成生動流暢、長達1080p分辨率、48幀每秒的視頻，在唇形同步精度、情感與動態表現力、指令可控性、身份保持以及跨域泛化能力上均表現優異。這些成果確立了Kling-Avatar作為語義基礎上高保真音頻驅動虛擬形象合成的新基準。

English

Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

Kling-Avatar：基於多模態指令的級聯式長時程虛擬角色動畫合成

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

摘要

Support