콰이 키이-VL 기술 보고서

초록

멀티모달 대형 언어 모델(MLLMs)은 정적 이미지에서 뛰어난 성능을 보이지만, 오늘날 디지털 환경에서 주류를 이루는 동적이고 정보가 밀집된 짧은 형식의 비디오를 이해하는 데는 종종 한계를 보입니다. 이러한 격차를 해소하기 위해, 우리는 짧은 비디오 이해에서 최첨단 성능을 발휘하면서도 일반적인 시각-언어 능력을 유지할 수 있는 80억 개의 파라미터를 가진 멀티모달 기반 모델인 Kwai Keye-VL을 소개합니다. Keye-VL의 개발은 두 가지 핵심 요소에 기반합니다: 비디오에 중점을 둔 6000억 개 이상의 토큰으로 구성된 대규모 고품질 데이터셋과 혁신적인 훈련 방법론입니다. 이 방법론은 견고한 시각-언어 정렬을 위한 4단계 사전 훈련 과정과 세심한 2단계 사후 훈련 과정으로 구성됩니다. 첫 번째 사후 훈련 단계는 명령 수행과 같은 기본 기능을 강화하고, 두 번째 단계는 고급 추론 능력을 자극하는 데 초점을 맞춥니다. 이 두 번째 단계에서 핵심 혁신은 '생각', '비-생각', '자동-생각', '이미지와 함께 생각', 그리고 고품질 비디오 데이터를 포함한 5가지 모드의 '콜드 스타트' 데이터 혼합입니다. 이 혼합은 모델이 언제, 어떻게 추론할지 결정하도록 가르칩니다. 이후의 강화 학습(RL)과 정렬 단계는 이러한 추론 능력을 더욱 강화하고 반복 출력과 같은 비정상적인 모델 행동을 수정합니다. 우리의 접근 방식을 검증하기 위해, 우리는 광범위한 평가를 수행하여 Keye-VL이 공개 비디오 벤치마크에서 최첨단 결과를 달성하고 일반적인 이미지 기반 작업에서도 높은 경쟁력을 유지함을 보여줍니다(그림 1). 또한, 우리는 실제 짧은 비디오 시나리오에 맞춘 새로운 벤치마크인 KC-MMBench를 개발하고 공개했으며, Keye-VL은 여기서도 상당한 우위를 보입니다.

English

While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce Kwai Keye-VL, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the KC-MMBench, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.

콰이 키이-VL 기술 보고서

Kwai Keye-VL Technical Report

초록

Support