快手 Keye-VL 1.5 技術報告

摘要

近年來，大型語言模型（LLMs）的發展取得了顯著進展，通過多模態大型語言模型（MLLMs）將其能力擴展至多模態任務。然而，由於視頻的動態性和信息密集性，視頻理解仍然是一個具有挑戰性的領域。現有模型在處理視頻內容時，難以在空間分辨率和時間覆蓋範圍之間取得平衡。我們提出了Keye-VL-1.5，該模型通過三項關鍵創新解決了視頻理解中的基本挑戰。首先，我們引入了一種新穎的慢-快視頻編碼策略，該策略基於幀間相似性動態分配計算資源，以更高分辨率處理視覺變化顯著的關鍵幀（慢路徑），同時以較低分辨率處理相對靜態的幀並增加時間覆蓋範圍（快路徑）。其次，我們實施了一種漸進式的四階段預訓練方法，系統地將模型的上下文長度從8K擴展到128K個標記，使其能夠處理更長的視頻和更複雜的視覺內容。第三，我們開發了一個全面的後訓練管道，專注於推理增強和人類偏好對齊，包括一個五步思維鏈數據構建過程、基於GSPO的迭代強化學習（針對困難案例的漸進提示）以及對齊訓練。通過在公共基準上的廣泛評估和嚴格的內部人類評估，Keye-VL-1.5展示了相較於現有模型的顯著改進，尤其在視頻理解任務中表現出色，同時在通用多模態基準上保持競爭力。

English

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

快手 Keye-VL 1.5 技術報告

Kwai Keye-VL 1.5 Technical Report

摘要

Support