克林阿凡达2.0技术报告
KlingAvatar 2.0 Technical Report
December 15, 2025
作者: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou
cs.AI
摘要
近年来,虚拟形象视频生成模型取得了显著进展。然而,现有方法在生成长时长高分辨率视频时效率有限,随着视频长度增加会出现时序漂移、质量下降和提示跟随能力弱等问题。为应对这些挑战,我们提出KlingAvatar 2.0——一种在空间分辨率和时间维度上进行双重升级的时空级联框架。该框架首先生成捕捉全局语义与运动的低分辨率蓝图视频关键帧,随后采用首尾帧策略将其细化为高分辨率、时序连贯的子片段,同时保持长视频中流畅的时间过渡。为增强长视频中的跨模态指令融合与对齐,我们引入了由三个模态专用大语言模型专家组成的协同推理导演模块。这些专家通过多轮对话推理模态优先级并推断用户潜在意图,将输入转化为详细剧情线。负向导演模块则进一步优化负向提示以提升指令对齐效果。基于这些组件,我们扩展框架以实现支持特定身份的多角色控制。大量实验表明,该模型能有效解决高效、多模态对齐的长时长高分辨率视频生成难题,在视觉清晰度、具有精准唇部同步的真实唇齿渲染、强身份保持以及连贯的多模态指令跟随方面均有显著提升。
English
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.