從像素到文字——邁向大規模原生一體化視覺模型

摘要

當前的視覺語言模型（VLM）通常透過多階段對齊方式，將獨立的影像編碼器與語言解碼器拼接起來，這種模組化架構不可避免地會跨畫面切割像素層級信號，並分散早期的像素與詞彙互動。與此同時，原生型VLM雖然在單張影像上表現不俗，但在多影像、影片理解及空間智能方面仍鮮少被探索。為此，我們提出NEO-ov，這是一個原生基礎模型，以端到端方式學習跨畫面及像素-詞彙對應，無需任何外部編碼器、輔助適配器或事後融合。透過完全消除模組界限，NEO-ov讓細粒度且統一的時空建模得以在模型內部原生湧現。值得注意的是，NEO-ov大幅縮小了與模組化方案的差距，同時在細粒度視覺感知上表現優異，驗證了原生「單一視覺」架構不僅可行，且在大規模應用上具競爭力。除實證性能外，我們也揭示了系統性的架構分析與詳細的訓練配方，以促進後續的原生多模態建模。我們的程式碼與模型已公開於：https://github.com/EvolvingLMMs-Lab/NEO。

English

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.