OneVision-Encoder:以編解碼器對齊稀疏性作為多模態智能的基礎原則
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
February 9, 2026
作者: Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
cs.AI
摘要
假說。人工通用智慧的核心本質是壓縮問題。有效的壓縮需要共振效應:當深度學習架構與資料的根本結構對齊時,其擴展效能最佳。這些是基本原則。然而,現代視覺架構已偏離這些真理:視覺信號具有高度冗餘性,而用於辨識的關鍵資訊——即「意外性」——卻是稀疏的。現有模型均勻處理稠密像素網格,將大量計算浪費在靜態背景上,而非聚焦於定義運動與意義的預測殘差。我們主張,要解決視覺理解問題,必須讓架構與影片的資訊理論原則(即編解碼器原理)對齊。
方法。OneVision-Encoder 透過將預測性視覺結構壓縮為語義來編碼影片。採用編解碼器分塊化技術後,OV-Encoder 放棄均勻計算,專注於僅佔 3.1%-25% 的高信號熵區域。為在不規則令牌佈局下統一空間與時間推理,OV-Encoder 採用共享 3D RoPE 技術,並透過涵蓋逾百萬語義概念的大規模集群辨識目標進行訓練,同時捕捉物體恆存性與運動動力學。
實證。結果驗證了我們的核心假說:效率與準確性並非取捨關係,而是正向相關。整合至大型語言模型後,其在 16 項圖像、影片及文件理解基準測試中,持續超越 Qwen3-ViT 與 SigLIP2 等強力視覺骨幹模型,且視覺令牌數量與預訓練資料量顯著更少。值得注意的是,在影片理解任務中,OV-Encoder 較 Qwen3-ViT 平均提升 4.1% 效能。與編解碼器對齊的塊級稀疏性成為基礎原則,使 OV-Encoder 成為可擴展的新一代視覺通用引擎。
English
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs.
Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics.
Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.