Kwai Keye-VL-2.0 技術報告

摘要

我們推出了 Kwai Keye-VL-2.0-30B-A3B，這是一個開源的混合專家（MoE）多模態基礎模型，旨在推動長影片理解與智能體智能的發展。為了應對小時級影片中超長上下文、資訊冗餘及高昂計算成本的挑戰，Keye-VL-2.0 首次將 DeepSeek 稀疏注意力（DSA）適配至基於 GQA 的多模態架構，實現無損的 256K 上下文處理，同時捕捉關鍵幀與長程時間依賴關係。此架構奠基於高度優化的訓練與推理基礎設施，包括可擴展的影片 I/O、異構 ViT-LM 並行運算，以及自定義的 DSA 核心，顯著提升吞吐量並降低計算開銷。此外，為了解決多任務對齊過程中災難性遺忘的演算法困境，我們引入了跨模態多教師在線策略蒸餾（MOPD），並結合 Context-RL 與 Video-RL。透過將從在線策略 rollout 中獲得的密集 token 級教師反饋，蒸餾回僅激活 3B 參數的 MoE 骨幹網路，Keye-VL-2.0 原生支援跨程式碼、工具與搜尋場景的高階智能體協作，並具備多模態自我修正能力。在影片理解、時間定位、推理、STEM 及智能體基準測試上的廣泛評估結果顯示，Keye-VL-2.0-30B-A3B 在同規模模型中達到了最先進的性能，特別是在 TimeLens 上的細粒度時間定位，以及 Video-MME-v2 與 LongVideoBench 上的長影片理解方面表現尤為突出。我們開放了模型檢查點，以加速社群朝向可擴展且穩健的多模態智能體應用邁進。

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.