從像素到文字——邁向大規模原生視覺語言基元

摘要

原生视觉-语言模型（VLMs）的构建，随着模型架构与训练范式的演进，已成为挑战传统模块化VLMs的新兴力量。然而，其广泛探索与推广之路仍被两大疑云所笼罩：其一，原生VLMs与模块化VLMs之间存在着哪些本质性限制，这些障碍又能在多大程度上被克服？其二，如何使原生VLMs的研究更加易于接触与普及，从而加速该领域的进展。本文旨在阐明这些挑战，并勾勒出构建原生VLMs的指导原则。具体而言，一个原生VLM的基础应具备以下特质：（i）在共享语义空间内有效对齐像素与词汇表征；（ii）无缝融合先前独立的视觉与语言模块的优势；（iii）内在蕴含多种跨模态特性，支持统一的视觉-语言编码、对齐与推理。基于此，我们推出了NEO，一个从第一性原理出发构建的全新原生VLM家族，其能力足以在多样化的现实场景中与顶尖模块化模型相抗衡。仅需3.9亿图文对，NEO便能从零开始高效发展视觉感知，同时在我们精心设计的基础构件所构建的密集一体化模型内部，缓解视觉与语言间的冲突。我们将NEO定位为可扩展且强大的原生VLMs的基石，并配套一系列可复用组件，共同培育一个成本效益高且可扩展的生态系统。我们的代码与模型已公开于：https://github.com/EvolvingLMMs-Lab/NEO。

English

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

從像素到文字——邁向大規模原生視覺語言基元

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

摘要

Support