Kwai Keye-VL-2.0 技術報告

要旨

Kwai Keye-VL-2.0-30B-A3Bを紹介します。これは、長尺動画理解とエージェント型インテリジェンスを推進するために設計された、オープンソースのMixture-of-Experts (MoE) マルチモーダル基盤モデルです。超長文脈、情報の冗長性、そして時間単位の動画に内在する膨大な計算コストといった課題に対処するため、Keye-VL-2.0は初めてDeepSeek Sparse Attention (DSA) をGQAベースのマルチモーダルアーキテクチャに適用し、重要なフレームや長期的な時間依存関係を捉えながら、無損失の256K文脈処理を実現します。このアーキテクチャは、スケーラブルな動画入出力、異種混合のViT-LM並列処理、カスタムDSAカーネルなど、スループットを最大化し計算オーバーヘッドを最小化する高度に最適化されたトレーニング・推論基盤によって支えられています。さらに、マルチタスクアライメント中に生じる破滅的忘却というアルゴリズム上のジレンマを克服するために、Context-RLおよびVideo-RLと組み合わせたCross-Modal Multi-Teacher On-Policy Distillation (MOPD) を導入しました。オン・ポリシーのロールアウトからの密なトークンレベルの教師フィードバックを、わずか3BのパラメータをアクティベートするMoEバックボーンに蒸留することで、Keye-VL-2.0はコード、ツール、検索シナリオにわたる高度なエージェント連携を、マルチモーダルな自己修正とともにネイティブに実現します。動画理解、時間的根拠付け、推論、STEM、エージェントベンチマークにわたる広範な評価により、Keye-VL-2.0-30B-A3Bは同規模のモデルの中で最先端の性能を達成し、特にTimeLensにおける細粒度の時間的ローカライゼーション、Video-MME-v2およびLongVideoBenchにおける長尺動画理解で優れていることが示されています。私たちはモデルチェックポイントを公開し、スケーラブルで堅牢なマルチモーダルエージェントアプリケーションに向けたコミュニティの進展を加速します。

English

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.