在视觉之前学会观察:揭秘大语言模型从语言预训练中获得的视觉先验知识
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
September 30, 2025
作者: Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
cs.AI
摘要
尽管大型语言模型(LLMs)仅通过文本进行训练,却意外地形成了丰富的视觉先验知识。这些先验知识使得在相对少量的多模态数据下,能够解锁潜在的视觉能力以应对视觉任务,甚至在某些情况下,无需见过任何图像即可执行视觉任务。通过系统性分析,我们发现视觉先验——即在语言预训练过程中获得的关于视觉世界的隐性、涌现性知识——由可分离的感知与推理先验构成,各自具有独特的扩展趋势与来源。研究表明,LLM的潜在视觉推理能力主要通过对推理密集型数据(如代码、数学、学术文献)的预训练而发展,并呈渐进式扩展。这种从语言预训练中获得的推理先验具有可迁移性,并普遍适用于视觉推理。相比之下,感知先验则更广泛地源自多样化的语料库,且感知能力对视觉编码器及视觉指令调优数据更为敏感。同时,描述视觉世界的文本虽至关重要,但其对性能的影响迅速达到饱和。基于这些洞见,我们提出了一种以数据为中心的预训练方法,用于培养具备视觉意识的LLMs,并在1T令牌规模的预训练中验证了其有效性。我们的发现建立在超过100项控制实验和消耗50万GPU小时的基础上,涵盖了从LLM预训练到视觉对齐及监督式多模态微调的全流程MLLM构建,跨越五种模型规模、广泛的数据类别与混合方式,以及多种适应设置。除了主要发现外,我们还提出并验证了若干假设,并引入了多层次存在基准(MLE-Bench)。整体而言,这项工作为有意从语言预训练中培育视觉先验提供了新途径,为下一代多模态LLMs的发展铺平了道路。
English
Large Language Models (LLMs), despite being trained on text alone,
surprisingly develop rich visual priors. These priors allow latent visual
capabilities to be unlocked for vision tasks with a relatively small amount of
multimodal data, and in some cases, to perform visual tasks without ever having
seen an image. Through systematic analysis, we reveal that visual priors-the
implicit, emergent knowledge about the visual world acquired during language
pre-training-are composed of separable perception and reasoning priors with
unique scaling trends and origins. We show that an LLM's latent visual
reasoning ability is predominantly developed by pre-training on
reasoning-centric data (e.g., code, math, academia) and scales progressively.
This reasoning prior acquired from language pre-training is transferable and
universally applicable to visual reasoning. In contrast, a perception prior
emerges more diffusely from broad corpora, and perception ability is more
sensitive to the vision encoder and visual instruction tuning data. In
parallel, text describing the visual world proves crucial, though its
performance impact saturates rapidly. Leveraging these insights, we propose a
data-centric recipe for pre-training vision-aware LLMs and verify it in 1T
token scale pre-training. Our findings are grounded in over 100 controlled
experiments consuming 500,000 GPU-hours, spanning the full MLLM construction
pipeline-from LLM pre-training to visual alignment and supervised multimodal
fine-tuning-across five model scales, a wide range of data categories and
mixtures, and multiple adaptation setups. Along with our main findings, we
propose and investigate several hypotheses, and introduce the Multi-Level
Existence Bench (MLE-Bench). Together, this work provides a new way of
deliberately cultivating visual priors from language pre-training, paving the
way for the next generation of multimodal LLMs.