在视觉之前学会观察：揭秘大语言模型从语言预训练中获得的视觉先验知识

摘要

尽管大型语言模型（LLMs）仅通过文本进行训练，却意外地形成了丰富的视觉先验知识。这些先验知识使得在相对少量的多模态数据下，能够解锁潜在的视觉能力以应对视觉任务，甚至在某些情况下，无需见过任何图像即可执行视觉任务。通过系统性分析，我们发现视觉先验——即在语言预训练过程中获得的关于视觉世界的隐性、涌现性知识——由可分离的感知与推理先验构成，各自具有独特的扩展趋势与来源。研究表明，LLM的潜在视觉推理能力主要通过对推理密集型数据（如代码、数学、学术文献）的预训练而发展，并呈渐进式扩展。这种从语言预训练中获得的推理先验具有可迁移性，并普遍适用于视觉推理。相比之下，感知先验则更广泛地源自多样化的语料库，且感知能力对视觉编码器及视觉指令调优数据更为敏感。同时，描述视觉世界的文本虽至关重要，但其对性能的影响迅速达到饱和。基于这些洞见，我们提出了一种以数据为中心的预训练方法，用于培养具备视觉意识的LLMs，并在1T令牌规模的预训练中验证了其有效性。我们的发现建立在超过100项控制实验和消耗50万GPU小时的基础上，涵盖了从LLM预训练到视觉对齐及监督式多模态微调的全流程MLLM构建，跨越五种模型规模、广泛的数据类别与混合方式，以及多种适应设置。除了主要发现外，我们还提出并验证了若干假设，并引入了多层次存在基准（MLE-Bench）。整体而言，这项工作为有意从语言预训练中培育视觉先验提供了新途径，为下一代多模态LLMs的发展铺平了道路。

English

Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.

在视觉之前学会观察：揭秘大语言模型从语言预训练中获得的视觉先验知识

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

摘要

Support