Kwai Keye-VL 1.5 技術レポート

要旨

近年、大規模言語モデル（LLMs）の開発が著しく進展し、マルチモーダル大規模言語モデル（MLLMs）を通じてその能力がマルチモーダルタスクに拡張されてきました。しかし、動画の理解は、その動的かつ情報密度の高い性質から、依然として困難な領域です。既存のモデルは、動画コンテンツを処理する際に空間解像度と時間的カバレッジのトレードオフに苦戦しています。本論文では、Keye-VL-1.5を紹介し、動画理解における根本的な課題を3つの主要なイノベーションを通じて解決します。第一に、フレーム間の類似性に基づいて計算リソースを動的に割り当てる新たなSlow-Fast動画エンコーディング戦略を導入します。これにより、視覚的に大きな変化のあるキーフレームを高解像度で処理（Slowパス）し、比較的静的なフレームを低解像度で高い時間的カバレッジで処理（Fastパス）します。第二に、モデルのコンテキスト長を8Kから128Kトークンへと体系的に拡張する4段階のプログレッシブ事前学習手法を実装し、より長い動画や複雑な視覚コンテンツの処理を可能にします。第三に、推論能力の強化と人間の嗜好への適合に焦点を当てた包括的なポストトレーニングパイプラインを開発します。これには、5段階の連鎖思考データ構築プロセス、困難なケースに対するプログレッシブプロンプトヒントを用いた反復的なGSPOベースの強化学習、およびアライメントトレーニングが含まれます。公開ベンチマークでの広範な評価と厳格な内部人間評価を通じて、Keye-VL-1.5は既存のモデルを大幅に上回り、特に動画理解タスクで優れた性能を示しつつ、一般的なマルチモーダルベンチマークでも競争力のある性能を維持しています。

English

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

Kwai Keye-VL 1.5 技術レポート

Kwai Keye-VL 1.5 Technical Report

要旨

Support