快手 Keye-VL 1.5 技術報告
Kwai Keye-VL 1.5 Technical Report
September 1, 2025
作者: Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
cs.AI
摘要
近年來,大型語言模型(LLMs)的發展取得了顯著進展,通過多模態大型語言模型(MLLMs)將其能力擴展至多模態任務。然而,由於視頻的動態性和信息密集性,視頻理解仍然是一個具有挑戰性的領域。現有模型在處理視頻內容時,難以在空間分辨率和時間覆蓋範圍之間取得平衡。我們提出了Keye-VL-1.5,該模型通過三項關鍵創新解決了視頻理解中的基本挑戰。首先,我們引入了一種新穎的慢-快視頻編碼策略,該策略基於幀間相似性動態分配計算資源,以更高分辨率處理視覺變化顯著的關鍵幀(慢路徑),同時以較低分辨率處理相對靜態的幀並增加時間覆蓋範圍(快路徑)。其次,我們實施了一種漸進式的四階段預訓練方法,系統地將模型的上下文長度從8K擴展到128K個標記,使其能夠處理更長的視頻和更複雜的視覺內容。第三,我們開發了一個全面的後訓練管道,專注於推理增強和人類偏好對齊,包括一個五步思維鏈數據構建過程、基於GSPO的迭代強化學習(針對困難案例的漸進提示)以及對齊訓練。通過在公共基準上的廣泛評估和嚴格的內部人類評估,Keye-VL-1.5展示了相較於現有模型的顯著改進,尤其在視頻理解任務中表現出色,同時在通用多模態基準上保持競爭力。
English
In recent years, the development of Large Language Models (LLMs) has
significantly advanced, extending their capabilities to multimodal tasks
through Multimodal Large Language Models (MLLMs). However, video understanding
remains a challenging area due to the dynamic and information-dense nature of
videos. Existing models struggle with the trade-off between spatial resolution
and temporal coverage when processing video content. We present Keye-VL-1.5,
which addresses fundamental challenges in video comprehension through three key
innovations. First, we introduce a novel Slow-Fast video encoding strategy that
dynamically allocates computational resources based on inter-frame similarity,
processing key frames with significant visual changes at higher resolution
(Slow pathway) while handling relatively static frames with increased temporal
coverage at lower resolution (Fast pathway). Second, we implement a progressive
four-stage pre-training methodology that systematically extends the model's
context length from 8K to 128K tokens, enabling processing of longer videos and
more complex visual content. Third, we develop a comprehensive post-training
pipeline focusing on reasoning enhancement and human preference alignment,
incorporating a 5-step chain-of-thought data construction process, iterative
GSPO-based reinforcement learning with progressive prompt hinting for difficult
cases, and alignment training. Through extensive evaluation on public
benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates
significant improvements over existing models, particularly excelling in video
understanding tasks while maintaining competitive performance on general
multimodal benchmarks.