Kwai Keye-VL-2.0 技术报告

摘要

我们介绍了Kwai Keye-VL-2.0-30B-A3B，这是一个开源的混合专家（MoE）多模态基础模型，旨在推进长视频理解和智能体（Agent）智能。为应对超长上下文、信息冗余以及小时级视频固有的高昂计算成本等挑战，Keye-VL-2.0 首次将 DeepSeek 稀疏注意力（DSA）适配到基于 GQA 的多模态架构中，实现了无损的 256K 上下文处理，同时能够捕捉关键帧和长程时间依赖关系。该架构依托高度优化的训练与推理基础设施，包括可扩展的视频 I/O、异构 ViT-LM 并行化以及定制的 DSA 内核，显著提升了吞吐量并最大程度地降低了计算开销。此外，为了克服多任务对齐过程中的灾难性遗忘算法难题，我们引入了跨模态多教师同策略蒸馏（MOPD），并结合 Context-RL 和 Video-RL。通过将从同策略交互中得到的密集令牌级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络，Keye-VL-2.0 原生支持了跨代码、工具和搜索场景的高级智能体协作，并具备多模态自我修正能力。在视频理解、时间定位、推理、STEM 以及智能体基准上的广泛评估表明，Keye-VL-2.0-30B-A3B 在同规模模型中达到了最先进的性能，尤其在 TimeLens 上的细粒度时间定位以及 Video-MME-v2 和 LongVideoBench 上的长视频理解方面表现突出。我们发布了模型检查点，以加速社区向可扩展且鲁棒的多模态智能体应用发展。

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.