LLaVA-OneVision-2: 迈向下一代感知智能

摘要

我们推出LLaVA-OneVision-2 (LLaVA-OV-2)，这是迄今LLaVA-OneVision系列中能力最强的视觉-语言模型，在广泛的多模态基准测试中实现了卓越性能。该模型基于原生OneVision编码器，并引入窗口注意力机制，在保持原生分辨率的同时实现高效的局部计算。其关键进展是编解码流令牌化：它将压缩视频视为连续的比特开销流，其中比特开销动态决定自适应时间分组，运动残差线索则选择显著空间证据，将其压缩至紧凑的视觉画布中。这种分配将有限的令牌预算集中在承载事件的内容上，与固定画面组相比，实现了更稳定的长视频令牌压缩。共享的3D旋转位置编码进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外，我们围绕大规模开放监督构建了LLaVA-OV-2的数据和训练栈：约800万重标注视频样本用于预训练，400万样本的空间语料用于微调。我们还引入JumpScore，这是一个时间定位基准，针对高频密集重复运动中的细粒度定位，而这一场景在现有视频评估中代表性不足。LLaVA-OV-2的突出能力是其统一感知，涵盖视频理解、时间定位、空间定位和操作轨迹推理。在JumpScore上，LLaVA-OneVision-2-8B达到74.9 JumpScore mAP，超越Qwen3-VL-8B (30.1)达44.8个百分点；在同一基准测试匹配的视觉令牌预算下，编解码流输入相比帧采样，时间定位提升9.7个百分点。跨标准基准测试，LLaVA-OneVision-2-8B在视频任务上平均超越Qwen3-VL-8B 4.3个百分点，空间任务5.3个百分点，追踪任务平均J&F 15.6个百分点。

English

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.