Kwai Keye-VL-2.0 Technisch Rapport

Samenvatting

We introduceren Kwai Keye-VL-2.0-30B-A3B, een open-source Mixture-of-Experts (MoE) multimodaal funderingsmodel ontworpen om lange-videobegrip en agentische intelligentie te bevorderen. Om de uitdagingen van ultra-lange contexten, informatie redundantie en prohibitieve rekenkosten die inherent zijn aan video's van uur-niveau aan te pakken, is Keye-VL-2.0 de eerste die DeepSeek Sparse Attention (DSA) aanpast aan GQA-gebaseerde multimodale architecturen, waardoor verliesvrije contextverwerking van 256K mogelijk wordt, terwijl kritieke frames en lange-termijn temporele afhankelijkheden worden vastgelegd. Deze architectuur wordt ondersteund door een sterk geoptimaliseerde trainings- en inferentie-infrastructuur, waaronder schaalbare video-I/O, heterogene ViT-LM-parallellisatie en aangepaste DSA-kernels die de doorvoer aanzienlijk maximaliseren en de rekenkosten minimaliseren. Verder introduceren we, om het algoritmische dilemma van catastrofale vergetelheid tijdens multi-taak afstemming te overwinnen, Cross-Modal Multi-Teacher On-Policy Distillation (MOPD), gekoppeld aan Context-RL en Video-RL. Door dichte token-niveau docentfeedback van on-policy-rollouts terug te destilleren in de MoE-ruggengraat, die slechts 3B parameters activeert, stelt Keye-VL-2.0 van nature geavanceerde agentsamenwerking in Code-, Tool- en Zoekscenario's in staat met multimodale zelfcorrectie. Uitgebreide evaluaties op het gebied van videobegrip, temporele grounding, redeneren, STEM en agentbenchmarks tonen aan dat Keye-VL-2.0-30B-A3B state-of-the-art prestaties levert onder modellen van vergelijkbare schaal, met name uitblinkend in fijnmazige temporele lokalisatie op TimeLens en lang-videobegrip op Video-MME-v2 en LongVideoBench. We geven onze modelcontrolepunten vrij om de gemeenschap te versnellen richting schaalbare en robuuste multimodale agentische toepassingen.

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.