ピクセルから言葉へ ― スケールでのネイティブなOne-Visionモデルを目指して

要旨

現在の視覚言語モデル（VLM）は、通常、画像エンコーダと言語デコーダを多段階のアライメントで接続するモジュール型フレームワークを採用しており、この構造ではフレーム間のピクセルレベルの信号が断片化され、初期段階でのピクセル-単語間の相互作用が散在するという問題が避けられない。一方、ネイティブVLMは単一画像での印象的な性能を示しているものの、マルチ画像や映像理解、空間知能の分野ではほとんど未開拓である。そこで我々は、外部エンコーダや補助アダプタ、後処理融合を一切用いずに、フレーム間およびピクセル-単語の対応関係をエンドツーエンドで学習するネイティブ基盤モデルNEO-ovを提案する。モジュール境界を完全に排除することで、NEO-ovはモデル内部にネイティブな形で細粒度かつ統一的な時空間モデリングを実現する。特筆すべきは、NEO-ovがモジュール型モデルとの性能ギャップを大幅に縮めつつ、細粒度の視覚知覚で優れた成果を挙げており、ネイティブな「ワンビジョン」アーキテクチャが大規模において実現可能かつ競争力を持つことを実証していることである。実証的性能に加え、我々は体系的なアーキテクチャ分析と詳細な訓練レシピを公開し、その後のネイティブマルチモーダルモデリングを促進する。コードとモデルはhttps://github.com/EvolvingLMMs-Lab/NEOで公開している。

English

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.