LLaVA-OneVision-2: 차세대 지각 지능을 향하여

초록

우리는 LLaVA-OneVision 시리즈 중 현재까지 가장 강력한 비전-언어 모델인 LLaVA-OneVision-2(LLaVA-OV-2)를 소개합니다. 이 모델은 다양한 멀티모달 벤치마크에서 뛰어난 성능을 달성합니다. 본 모델은 네이티브 OneVision-Encoder를 기반으로 구축되었으며, 네이티브 해상도를 유지하면서 효율적인 지역 연산을 위해 Windowed Attention을 통합했습니다. 핵심적인 발전은 코덱-스트림 토큰화(codec-stream tokenization)입니다: 압축된 비디오를 연속적인 비트-비용 스트림으로 처리하여, 비트-비용 동역학이 적응형 시간 그룹을 결정하고, 움직임-잔차 신호가 공간적 증거를 선택하여 간결한 시각적 캔버스(visual canvases)로 구성합니다. 이 할당 방식은 제한된 토큰 예산을 이벤트 중심 콘텐츠에 집중시켜, 고정된 픽처 그룹보다 더 안정적인 장기 비디오 토큰 압축을 가능하게 합니다. 공유된 3D RoPE는 코덱 캔버스, 샘플링된 프레임, 이미지를 통합된 시공간 좌표계에 배치합니다. 또한, LLaVA-OV-2 데이터 및 학습 스택을 대규모 공개 감독 주변에 구축했습니다: 사전 학습을 위해 약 800만 개의 재캡션된 비디오 샘플, 미세 조정을 위한 400만 개 샘플의 공간 코퍼스입니다. 또한, 고주파수, 고밀도 반복 움직임에서의 세분화된 접지를 대상으로 하는 시간적 위치 파악 벤치마크인 JumpScore를 도입합니다. 이는 기존 비디오 평가에서 과소 대표된 영역입니다. LLaVA-OV-2의 두드러진 능력은 비디오 이해, 시간적 접지, 공간적 접지, 조작-추적 추론에 걸친 통합된 인식입니다. JumpScore에서 LLaVA-OneVision-2-8B는 74.9 JumpScore mAP에 도달하여, Qwen3-VL-8B(30.1)를 +44.8포인트 초과합니다; 동일한 벤치마크에서 일치된 시각적 토큰 예산 하에서, 코덱-스트림 입력은 프레임 샘플링 대비 시간적 접지를 +9.7포인트 향상시킵니다. 표준 벤치마크에서 LLaVA-OneVision-2-8B는 Qwen3-VL-8B를 비디오 작업에서 평균 +4.3포인트, 공간 작업에서 +5.3포인트, 추적 작업에서 평균 J&F +15.6포인트 더 능가합니다.

English

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.