## Kling全模态技术报告
Kling-Omni Technical Report
December 18, 2025
作者: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu
cs.AI
摘要
我们推出Kling-Omni——一个通用生成式框架,能够直接从多模态视觉语言输入中合成高保真度视频。该框架采用端到端视角,弥合了多样化视频生成、编辑与智能推理任务之间的功能割裂,将其整合为统一系统。与碎片化的流水线方法不同,Kling-Omni支持包括文本指令、参考图像和视频上下文在内的多样化用户输入,将其处理为统一的多模态表征,最终输出具有电影级画质与高度智能化的视频内容。为支撑这些能力,我们构建了作为多模态视频创作基础的综合数据系统,并通过高效的大规模预训练策略与推理基础设施优化进一步强化框架性能。综合评估表明,Kling-Omni在上下文生成、基于推理的编辑及多模态指令跟随方面展现出卓越能力。我们相信,Kling-Omni不仅是内容创作工具,更是向能够感知、推理、生成并交互动态复杂世界的多模态世界模拟器迈出的关键一步。
English
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.