Kwai Keye-VL-2.0 기술 보고서

초록

Kwai Keye-VL-2.0-30B-A3B를 소개합니다. 이는 오픈소스 Mixture-of-Experts (MoE) 멀티모달 기반 모델로, 긴 비디오 이해와 에이전트 지능을 발전시키기 위해 설계되었습니다. 시간 단위 비디오에 내재된 초장기 컨텍스트, 정보 중복, 그리고 엄청난 계산 비용의 문제를 해결하기 위해 Keye-VL-2.0은 GQA 기반 멀티모달 아키텍처에 DeepSeek Sparse Attention (DSA)을 최초로 적용하여, 손실 없는 256K 컨텍스트 처리를 가능하게 하면서 핵심 프레임과 장기 시간적 의존성을 포착합니다. 이 아키텍처는 확장 가능한 비디오 I/O, 이종 ViT-LM 병렬 처리, 그리고 처리량을 극대화하고 계산 오버헤드를 최소화하는 맞춤형 DSA 커널을 포함한 고도로 최적화된 학습 및 추론 인프라에 의해 뒷받침됩니다. 또한, 다중 작업 정렬 중 치명적 망각의 알고리즘적 딜레마를 극복하기 위해, Context-RL 및 Video-RL과 결합된 Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)을 도입합니다. 온-폴리시 롤아웃에서 얻은 밀집된 토큰 수준의 교사 피드백을 오직 3B 파라미터만 활성화하는 MoE 백본에 증류함으로써, Keye-VL-2.0은 코드, 도구, 검색 시나리오 전반에 걸쳐 멀티모달 자기 교정을 통한 고급 에이전트 협업을 본질적으로 가능하게 합니다. 비디오 이해, 시간적 근거 추론, 추론, STEM, 그리고 에이전트 벤치마크에 대한 광범위한 평가에서 Keye-VL-2.0-30B-A3B는 유사한 규모의 모델 중 최고 성능을 달성하며, 특히 TimeLens에서의 세분화된 시간적 위치 파악과 Video-MME-v2 및 LongVideoBench에서의 긴 비디오 이해에서 뛰어난 성과를 보여줍니다. 확장 가능하고 강건한 멀티모달 에이전트 애플리케이션을 향한 커뮤니티의 발전을 가속화하기 위해 모델 체크포인트를 공개합니다.

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.