语言模型的视觉检查

摘要

学习对字符串之间关系建模会教会大型语言模型（LLMs）有关视觉世界的知识吗？我们系统评估了LLMs生成和识别一系列逐渐复杂的视觉概念的能力，然后展示了如何使用文本模型训练初步的视觉表示学习系统。由于语言模型缺乏处理或输出像素形式的视觉信息的能力，我们在研究中使用代码来表示图像。尽管LLM生成的图像看起来不像自然图像，但在图像生成和模型纠正这些生成图像的能力方面的结果表明，对字符串的精确建模可以教会语言模型有关视觉世界的许多方面。此外，利用文本模型生成的图像进行自监督视觉表示学习的实验突显了使用仅LLMs就能训练出能够对自然图像进行语义评估的视觉模型的潜力。

English

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

语言模型的视觉检查

A Vision Check-up for Language Models

摘要

Support