픽셀에서 단어로 — 대규모 네이티브 One-Vision 모델을 향하여

초록

현재 시각-언어 모델(VLM)은 일반적으로 다단계 정렬을 통해 별도의 이미지 인코더와 언어 디코더를 결합하는 모듈식 프레임워크를 사용하며, 이는 필연적으로 프레임 간 픽셀 수준 신호를 분할하고 초기 픽셀-단어 상호작용을 산만하게 만든다. 이와 병행하여, 네이티브 VLM은 단일 이미지에서 인상적인 성능을 보임에도 불구하고 다중 이미지, 비디오 이해 및 공간 지능 측면에서는 거의 탐구되지 않은 상태이다. 이에 우리는 외부 인코더, 보조 어댑터 또는 사후 융합 없이 프레임 간 및 픽셀-단어 대응을 종단간 학습하는 네이티브 기초 모델인 NEO-ov를 소개한다. 모듈 경계를 완전히 제거함으로써 NEO-ov는 모델 내부에서 고유하게 세밀하고 통합된 시공간 모델링이 발현되도록 한다. 주목할 점은 NEO-ov가 모듈식 대응 모델과의 격차를 크게 좁히는 동시에 세밀한 시각 인식에서 탁월한 성능을 보여, 네이티브 '단일 비전' 아키텍처가 확장 가능한 수준에서 실행 가능할 뿐만 아니라 경쟁력 있음을 검증한다는 것이다. 실증적 성능 외에도, 우리는 체계적인 아키텍처 분석과 상세한 훈련 레시피를 공개하여 후속 네이티브 멀티모달 모델링을 촉진한다. 코드와 모델은 다음에서 공개적으로 이용 가능하다: https://github.com/EvolvingLMMs-Lab/NEO.

English

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.