ChatPaper.aiChatPaper

Kwai Keye-VL-2.0 技術報告

Kwai Keye-VL-2.0 Technical Report

June 9, 2026
作者: Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang
cs.AI

摘要

我們推出了 Kwai Keye-VL-2.0-30B-A3B,這是一個開源的混合專家(MoE)多模態基礎模型,旨在推動長影片理解與智能體智能的發展。為了應對小時級影片中超長上下文、資訊冗餘及高昂計算成本的挑戰,Keye-VL-2.0 首次將 DeepSeek 稀疏注意力(DSA)適配至基於 GQA 的多模態架構,實現無損的 256K 上下文處理,同時捕捉關鍵幀與長程時間依賴關係。此架構奠基於高度優化的訓練與推理基礎設施,包括可擴展的影片 I/O、異構 ViT-LM 並行運算,以及自定義的 DSA 核心,顯著提升吞吐量並降低計算開銷。此外,為了解決多任務對齊過程中災難性遺忘的演算法困境,我們引入了跨模態多教師在線策略蒸餾(MOPD),並結合 Context-RL 與 Video-RL。透過將從在線策略 rollout 中獲得的密集 token 級教師反饋,蒸餾回僅激活 3B 參數的 MoE 骨幹網路,Keye-VL-2.0 原生支援跨程式碼、工具與搜尋場景的高階智能體協作,並具備多模態自我修正能力。在影片理解、時間定位、推理、STEM 及智能體基準測試上的廣泛評估結果顯示,Keye-VL-2.0-30B-A3B 在同規模模型中達到了最先進的性能,特別是在 TimeLens 上的細粒度時間定位,以及 Video-MME-v2 與 LongVideoBench 上的長影片理解方面表現尤為突出。我們開放了模型檢查點,以加速社群朝向可擴展且穩健的多模態智能體應用邁進。
English
We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.