ChatPaper.aiChatPaper

Kling-Omni 技术报告

Kling-Omni Technical Report

December 18, 2025
作者: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu
cs.AI

摘要

我們推出Kling-Omni——一個通用型生成式框架,專為從多模態視覺語言輸入直接合成高擬真度影片而設計。該框架採用端到端視角,將多樣化的影片生成、編輯與智能推理任務功能有機融合,構建成統一系統。有別於割裂的流水線式方案,Kling-Omni支援文字指令、參考圖像、影片上下文等多類用戶輸入,將其處理為統一的多模態表徵,實現具電影級畫質與高度智能化的影片內容創作。為支撐這些能力,我們構建了完備的數據系統作為多模態影片創作的基礎。該框架還通過高效的大規模預訓練策略與推論基礎設施優化得到強化。綜合評估表明,Kling-Omni在情境化生成、基於推理的編輯及多模態指令遵循方面展現卓越能力。我們認為Kling-Omni不僅是內容創作工具,更是邁向多模態世界模擬器的關鍵突破,該模擬器能感知、推理、生成並與動態複雜世界進行互動。
English
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
PDF1222December 20, 2025