LLaVA-OneVision-2：邁向下一代感知智能

摘要

我們介紹 LLaVA-OneVision-2（LLaVA-OV-2），這是 LLaVA-OneVision 系列中目前能力最強的視覺語言模型，在廣泛的多模態基準測試中均展現出優異表現。該模型基於原生 OneVision 編碼器，並引入窗口注意力機制以實現高效的局部計算，同時維持原生解析度。其關鍵進展在於編解碼串流標記化：它將壓縮後的影片視為連續的位元成本串流，其中位元成本動態決定自適應時間分組，而運動殘差線索則選取顯著的空間證據，並將其整合至緊湊的視覺畫布中。這種分配方式將有限的標記預算集中於承載事件的內容，從而實現比固定圖像組更穩定之長影片標記壓縮。共享的 3D 旋轉位置編碼進一步將編解碼畫布、取樣幀與影像置於統一的時空座標系統中。此外，我們圍繞大規模開放監督建構了 LLaVA-OV-2 的資料與訓練堆疊：約 800 萬個重新標註的影片樣本用於預訓練，以及 400 萬個樣本的空間語料庫用於微調。我們也引進了 JumpScore，一個專注於高頻、密集重複動作中細粒度定位的時間定位基準，此類場景在現有影片評估中代表不足。LLaVA-OV-2 的一項突出能力是其對影片理解、時間定位、空間定位與操作軌跡推理的統一感知能力。在 JumpScore 上，LLaVA-OneVision-2-8B 達到了 74.9 的 JumpScore mAP，比 Qwen3-VL-8B（30.1）高出 44.8 個百分點；在相同基準測試且匹配視覺標記預算的條件下，編解碼串流輸入相比幀取樣在時間定位上提升了 9.7 個百分點。在標準基準測試中，LLaVA-OneVision-2-8B 在影片任務上平均比 Qwen3-VL-8B 高出 4.3 個百分點，在空間任務上高出 5.3 個百分點，在追蹤任務上其平均 J&F 則高出 15.6 個百分點。

English

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.