Kwai Keye-VL 技術報告
Kwai Keye-VL Technical Report
July 2, 2025
作者: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在静态图像上展现出卓越的能力,但在理解动态、信息密集的短视频——当今数字领域的主导媒介时,往往表现不足。为弥合这一差距,我们推出了Kwai Keye-VL,一个拥有80亿参数的多模态基础模型,专为在短视频理解中实现领先性能而设计,同时保持强大的通用视觉-语言能力。Keye-VL的开发基于两大核心支柱:一个超过6000亿标记的大规模高质量数据集,其中视频内容占据重要地位;以及一套创新的训练方案。该方案包括一个四阶段预训练过程,以确保视觉与语言的稳固对齐,随后是一个精细的两阶段后训练过程。第一阶段后训练旨在增强如指令跟随等基础能力,而第二阶段则聚焦于激发高级推理。在此第二阶段,一个关键创新是我们提出的五种模式“冷启动”数据混合,涵盖“思考”、“非思考”、“自动思考”、“带图思考”及高质量视频数据,教导模型何时及如何进行推理。随后的强化学习(RL)和对齐步骤进一步提升了这些推理能力,并纠正了如重复输出等异常模型行为。为验证我们的方法,我们进行了广泛的评估,结果显示Keye-VL在公开视频基准测试中达到了业界领先水平,并在通用图像任务上保持高度竞争力(见图1)。此外,我们开发并发布了KC-MMBench,一个专为现实世界短视频场景定制的新基准,Keye-VL在其中展现了显著优势。
English
While Multimodal Large Language Models (MLLMs) demonstrate remarkable
capabilities on static images, they often fall short in comprehending dynamic,
information-dense short-form videos, a dominant medium in today's digital
landscape. To bridge this gap, we introduce Kwai Keye-VL, an
8-billion-parameter multimodal foundation model engineered for leading-edge
performance in short-video understanding while maintaining robust
general-purpose vision-language abilities. The development of Keye-VL rests on
two core pillars: a massive, high-quality dataset exceeding 600 billion tokens
with a strong emphasis on video, and an innovative training recipe. This recipe
features a four-stage pre-training process for solid vision-language alignment,
followed by a meticulous two-phase post-training process. The first
post-training stage enhances foundational capabilities like instruction
following, while the second phase focuses on stimulating advanced reasoning. In
this second phase, a key innovation is our five-mode ``cold-start'' data
mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think
with image'', and high-quality video data. This mixture teaches the model to
decide when and how to reason. Subsequent reinforcement learning (RL) and
alignment steps further enhance these reasoning capabilities and correct
abnormal model behaviors, such as repetitive outputs. To validate our approach,
we conduct extensive evaluations, showing that Keye-VL achieves
state-of-the-art results on public video benchmarks and remains highly
competitive on general image-based tasks (Figure 1). Furthermore, we develop
and release the KC-MMBench, a new benchmark tailored for real-world
short-video scenarios, where Keye-VL shows a significant advantage.