LLaVA-OneVision-2: 次世代知覚知能に向けて

要旨

LLaVA-OneVision-2（LLaVA-OV-2）は、LLaVA-OneVisionシリーズにおける現時点で最も高性能な視覚言語モデルであり、幅広いマルチモーダルベンチマークにおいて優れた性能を達成している。本モデルはネイティブのOneVisionエンコーダを基盤とし、Windowed Attentionを導入することで、ネイティブ解像度を維持しつつ効率的な局所計算を実現している。その主要な進歩はコーデックストリームトークン化にある。すなわち、圧縮動画を連続的なビットコストストリームとして扱い、ビットコストの動的特性に基づいて適応的な時間グループを決定し、動き残差の手がかりを利用して顕著な空間的証拠をコンパクトな視覚キャンバスに選択する。この割り当てにより、限られたトークン予算をイベントを含むコンテンツに集中させることができ、固定された画像グループよりも安定した長尺動画のトークン圧縮が可能となる。さらに共有3D RoPEにより、コーデックキャンバス、サンプリングフレーム、画像を統一された時空間座標系に配置する。加えて、LLaVA-OV-2のデータおよび学習スタックは大規模なオープンな教師信号を中心に構築されており、事前学習用に約800万の再キャプション済み動画サンプル、ファインチューニング用に400万サンプルの空間コーパスを含む。また、既存の動画評価では過小評価されている、高頻度で高密度に繰り返される動作における細粒度の接地を対象とした時間的局所化ベンチマークであるJumpScoreを新たに導入する。LLaVA-OV-2の際立った能力は、動画理解、時間的接地、空間的接地、操作トレース推論にわたる統一的な知覚である。JumpScoreにおいて、LLaVA-OneVision-2-8Bは74.9のJumpScore mAPを達成し、Qwen3-VL-8B（30.1）を44.8ポイント上回る。同一ベンチマークにおいて同等の視覚トークン予算の下では、コーデックストリーム入力はフレームサンプリングに比べて時間的接地を9.7ポイント向上させる。標準ベンチマークにおいても、LLaVA-OneVision-2-8Bは動画タスクで平均4.3ポイント、空間タスクで5.3ポイント、追跡タスクで平均J&Fが15.6ポイント、それぞれQwen3-VL-8Bを上回る。

English

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.