ChatPaper.aiChatPaper

从像素到文字——迈向规模化原生视觉-语言基元

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

October 16, 2025
作者: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu
cs.AI

摘要

原生视觉-语言模型(VLMs)的架构已崭露头角,成为传统模块化VLMs的有力竞争者,这一发展得益于不断演进的模型架构与训练范式。然而,两大悬而未决的问题为其广泛探索与推广蒙上了阴影:首先,原生VLMs与模块化VLMs之间存在着哪些根本性限制,这些障碍又能在多大程度上被克服?其次,如何使原生VLMs的研究更加易于接触与普及,从而加速该领域的进步。本文中,我们明确了这些挑战,并勾勒出构建原生VLMs的指导原则。具体而言,一个原生VLM的基础应具备以下特征:(i) 在共享语义空间内有效对齐像素与词汇表示;(ii) 无缝整合先前独立的视觉与语言模块的优势;(iii) 内在地体现多种跨模态特性,支持统一的视觉-语言编码、对齐与推理。基于此,我们推出了NEO,一个从第一性原理出发构建的全新原生VLM家族,能够在多样化的现实场景中与顶尖模块化模型一较高下。仅需390M的图文样本,NEO便能在我们精心设计的基础之上,从零开始高效发展视觉感知,同时在一个密集且一体化的模型内部缓解视觉与语言间的冲突。我们将NEO定位为可扩展且强大的原生VLMs的基石,并配套一系列可复用组件,共同构建一个成本效益高且可扩展的生态系统。我们的代码与模型已公开于:https://github.com/EvolvingLMMs-Lab/NEO。
English
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
PDF662December 21, 2025