ChatPaper.aiChatPaper

LLaVA-OneVision-2:邁向下一代感知智能

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

May 25, 2026
作者: Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
cs.AI

摘要

我們介紹 LLaVA-OneVision-2(LLaVA-OV-2),這是 LLaVA-OneVision 系列中目前能力最強的視覺語言模型,在廣泛的多模態基準測試中均展現出優異表現。該模型基於原生 OneVision 編碼器,並引入窗口注意力機制以實現高效的局部計算,同時維持原生解析度。其關鍵進展在於編解碼串流標記化:它將壓縮後的影片視為連續的位元成本串流,其中位元成本動態決定自適應時間分組,而運動殘差線索則選取顯著的空間證據,並將其整合至緊湊的視覺畫布中。這種分配方式將有限的標記預算集中於承載事件的內容,從而實現比固定圖像組更穩定之長影片標記壓縮。共享的 3D 旋轉位置編碼進一步將編解碼畫布、取樣幀與影像置於統一的時空座標系統中。此外,我們圍繞大規模開放監督建構了 LLaVA-OV-2 的資料與訓練堆疊:約 800 萬個重新標註的影片樣本用於預訓練,以及 400 萬個樣本的空間語料庫用於微調。我們也引進了 JumpScore,一個專注於高頻、密集重複動作中細粒度定位的時間定位基準,此類場景在現有影片評估中代表不足。LLaVA-OV-2 的一項突出能力是其對影片理解、時間定位、空間定位與操作軌跡推理的統一感知能力。在 JumpScore 上,LLaVA-OneVision-2-8B 達到了 74.9 的 JumpScore mAP,比 Qwen3-VL-8B(30.1)高出 44.8 個百分點;在相同基準測試且匹配視覺標記預算的條件下,編解碼串流輸入相比幀取樣在時間定位上提升了 9.7 個百分點。在標準基準測試中,LLaVA-OneVision-2-8B 在影片任務上平均比 Qwen3-VL-8B 高出 4.3 個百分點,在空間任務上高出 5.3 個百分點,在追蹤任務上其平均 J&F 則高出 15.6 個百分點。
English
We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.