OneVision-Encoder:以编解码器对齐的稀疏性作为多模态智能的基础原则
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
February 9, 2026
作者: Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
cs.AI
摘要
**假说**:人工通用智能本质上是一个压缩问题。有效压缩需要共振——当深度学习架构与数据的底层结构对齐时,其扩展性最佳。这些是基本原则。然而,现代视觉架构已偏离这些本质:视觉信号具有高度冗余性,而判别性信息(即信息熵)却十分稀疏。现有模型对密集像素网格进行均匀处理,将大量算力浪费在静态背景上,而非聚焦于定义运动与语义的预测残差。我们认为,解决视觉理解问题必须让架构与视频的信息论原则(即编解码器原理)对齐。
**方法**:OneVision-Encoder通过将预测性视觉结构压缩为语义信息来实现视频编码。通过采用编解码器分块化技术,OV-Encoder摒弃均匀计算,专注处理仅占3.1%-25%的高信号熵区域。为在非规则令牌布局下统一时空推理,该模型采用共享3D旋转位置编码,并基于超百万语义概念进行大规模聚类判别训练,同时捕捉物体恒常性与运动动态。
**实证**:结果验证了核心假说——效率与精度并非权衡关系,而是正向关联。集成至大语言模型后,其在16项图像、视频及文档理解基准测试中持续超越Qwen3-ViT、SigLIP2等强视觉骨干网络,且使用的视觉令牌数和预训练数据量显著更少。尤其在视频理解任务上,OV-Encoder相较Qwen3-ViT平均提升4.1%。编解码器对齐的块级稀疏性作为基本原则,使OV-Encoder成为支撑下一代通用视觉模型的可扩展引擎。
English
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs.
Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics.
Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.