快手Keye-VL 1.5技术报告
Kwai Keye-VL 1.5 Technical Report
September 1, 2025
作者: Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
cs.AI
摘要
近年来,大型语言模型(LLMs)的发展取得了显著进展,通过多模态大型语言模型(MLLMs)将其能力扩展至多模态任务。然而,由于视频的动态性和信息密集性,视频理解仍是一个具有挑战性的领域。现有模型在处理视频内容时,难以在空间分辨率和时间覆盖范围之间取得平衡。我们提出了Keye-VL-1.5,通过三项关键创新解决了视频理解中的基本挑战。首先,我们引入了一种新颖的慢-快视频编码策略,该策略根据帧间相似性动态分配计算资源,以更高分辨率处理视觉变化显著的关键帧(慢路径),同时以较低分辨率处理相对静态的帧,增加时间覆盖范围(快路径)。其次,我们实施了一种渐进式的四阶段预训练方法,系统地将模型的上下文长度从8K扩展到128K个标记,使其能够处理更长的视频和更复杂的视觉内容。第三,我们开发了一个全面的后训练流程,专注于推理增强和人类偏好对齐,包括五步思维链数据构建过程、基于GSPO的迭代强化学习(针对困难案例的渐进提示)以及对齐训练。通过在公开基准上的广泛评估和严格的内部人类评估,Keye-VL-1.5在视频理解任务中表现出显著优于现有模型的性能,同时在通用多模态基准上保持竞争力。
English
In recent years, the development of Large Language Models (LLMs) has
significantly advanced, extending their capabilities to multimodal tasks
through Multimodal Large Language Models (MLLMs). However, video understanding
remains a challenging area due to the dynamic and information-dense nature of
videos. Existing models struggle with the trade-off between spatial resolution
and temporal coverage when processing video content. We present Keye-VL-1.5,
which addresses fundamental challenges in video comprehension through three key
innovations. First, we introduce a novel Slow-Fast video encoding strategy that
dynamically allocates computational resources based on inter-frame similarity,
processing key frames with significant visual changes at higher resolution
(Slow pathway) while handling relatively static frames with increased temporal
coverage at lower resolution (Fast pathway). Second, we implement a progressive
four-stage pre-training methodology that systematically extends the model's
context length from 8K to 128K tokens, enabling processing of longer videos and
more complex visual content. Third, we develop a comprehensive post-training
pipeline focusing on reasoning enhancement and human preference alignment,
incorporating a 5-step chain-of-thought data construction process, iterative
GSPO-based reinforcement learning with progressive prompt hinting for difficult
cases, and alignment training. Through extensive evaluation on public
benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates
significant improvements over existing models, particularly excelling in video
understanding tasks while maintaining competitive performance on general
multimodal benchmarks.