ChatPaper.aiChatPaper

LLaVA-OneVision-2: 迈向下一代感知智能

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

May 25, 2026
作者: Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
cs.AI

摘要

我们推出LLaVA-OneVision-2 (LLaVA-OV-2),这是迄今LLaVA-OneVision系列中能力最强的视觉-语言模型,在广泛的多模态基准测试中实现了卓越性能。该模型基于原生OneVision编码器,并引入窗口注意力机制,在保持原生分辨率的同时实现高效的局部计算。其关键进展是编解码流令牌化:它将压缩视频视为连续的比特开销流,其中比特开销动态决定自适应时间分组,运动残差线索则选择显著空间证据,将其压缩至紧凑的视觉画布中。这种分配将有限的令牌预算集中在承载事件的内容上,与固定画面组相比,实现了更稳定的长视频令牌压缩。共享的3D旋转位置编码进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外,我们围绕大规模开放监督构建了LLaVA-OV-2的数据和训练栈:约800万重标注视频样本用于预训练,400万样本的空间语料用于微调。我们还引入JumpScore,这是一个时间定位基准,针对高频密集重复运动中的细粒度定位,而这一场景在现有视频评估中代表性不足。LLaVA-OV-2的突出能力是其统一感知,涵盖视频理解、时间定位、空间定位和操作轨迹推理。在JumpScore上,LLaVA-OneVision-2-8B达到74.9 JumpScore mAP,超越Qwen3-VL-8B (30.1)达44.8个百分点;在同一基准测试匹配的视觉令牌预算下,编解码流输入相比帧采样,时间定位提升9.7个百分点。跨标准基准测试,LLaVA-OneVision-2-8B在视频任务上平均超越Qwen3-VL-8B 4.3个百分点,空间任务5.3个百分点,追踪任务平均J&F 15.6个百分点。
English
We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.