Kwai Keye-VL-2.0 Technischer Bericht

Zusammenfassung

Wir stellen Kwai Keye-VL-2.0-30B-A3B vor, ein quelloffenes multimodales Grundlagenmodell basierend auf der Mixture-of-Experts (MoE)-Architektur, das darauf abzielt, das Verständnis langer Videos und agentische Intelligenz voranzutreiben. Um die Herausforderungen ultra-langer Kontexte, Informationsredundanz und prohibitiv hoher Rechenkosten bei stundenlangen Videos zu bewältigen, ist Keye-VL-2.0 das erste Modell, das DeepSeek Sparse Attention (DSA) an auf GQA basierende multimodale Architekturen anpasst und so eine verlustfreie Verarbeitung von 256K Kontexten bei gleichzeitiger Erfassung von Schlüsselframes und langfristigen zeitlichen Abhängigkeiten ermöglicht. Diese Architektur wird durch eine hochoptimierte Trainings- und Inferenzinfrastruktur gestützt, die skalierbare Video-E/A, heterogenen ViT-LM-Parallelismus und benutzerdefinierte DSA-Kernel umfasst, wodurch der Durchsatz maximiert und der Rechenaufwand minimiert wird. Darüber hinaus führen wir zur Überwindung des algorithmischen Dilemmas des katastrophalen Vergessens während der Multi-Task-Ausrichtung die Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) in Kombination mit Context-RL und Video-RL ein. Durch die Destillation dichten Token-Level-Lehrer-Feedbacks aus On-Policy-Rollouts zurück in das MoE-Backbone, das nur 3B Parameter aktiviert, befähigt Keye-VL-2.0 nativ fortgeschrittene Agenten-Kollaboration in Code-, Tool- und Search-Szenarien mit multimodaler Selbstkorrektur. Umfassende Evaluierungen auf Benchmarks zu Videoverständnis, zeitlicher Verankerung, Reasoning, STEM und Agenten zeigen, dass Keye-VL-2.0-30B-A3B Spitzenleistungen unter Modellen vergleichbarer Größe erzielt, insbesondere bei der feinkörnigen zeitlichen Lokalisierung auf TimeLens und dem Verständnis langer Videos auf Video-MME-v2 und LongVideoBench. Wir veröffentlichen unsere Modell-Checkpoints, um den Fortschritt der Community hin zu skalierbaren und robusten multimodalen Agentenanwendungen zu beschleunigen.

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.