Kwai Keye-VL-2.0 技术报告
Kwai Keye-VL-2.0 Technical Report
June 9, 2026
作者: Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang
cs.AI
摘要
我们介绍了Kwai Keye-VL-2.0-30B-A3B,这是一个开源的混合专家(MoE)多模态基础模型,旨在推进长视频理解和智能体(Agent)智能。为应对超长上下文、信息冗余以及小时级视频固有的高昂计算成本等挑战,Keye-VL-2.0 首次将 DeepSeek 稀疏注意力(DSA)适配到基于 GQA 的多模态架构中,实现了无损的 256K 上下文处理,同时能够捕捉关键帧和长程时间依赖关系。该架构依托高度优化的训练与推理基础设施,包括可扩展的视频 I/O、异构 ViT-LM 并行化以及定制的 DSA 内核,显著提升了吞吐量并最大程度地降低了计算开销。此外,为了克服多任务对齐过程中的灾难性遗忘算法难题,我们引入了跨模态多教师同策略蒸馏(MOPD),并结合 Context-RL 和 Video-RL。通过将从同策略交互中得到的密集令牌级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络,Keye-VL-2.0 原生支持了跨代码、工具和搜索场景的高级智能体协作,并具备多模态自我修正能力。在视频理解、时间定位、推理、STEM 以及智能体基准上的广泛评估表明,Keye-VL-2.0-30B-A3B 在同规模模型中达到了最先进的性能,尤其在 TimeLens 上的细粒度时间定位以及 Video-MME-v2 和 LongVideoBench 上的长视频理解方面表现突出。我们发布了模型检查点,以加速社区向可扩展且鲁棒的多模态智能体应用发展。
English
We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.