语言模型的视觉检查
A Vision Check-up for Language Models
January 3, 2024
作者: Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba
cs.AI
摘要
学习对字符串之间关系建模会教会大型语言模型(LLMs)有关视觉世界的知识吗?我们系统评估了LLMs生成和识别一系列逐渐复杂的视觉概念的能力,然后展示了如何使用文本模型训练初步的视觉表示学习系统。由于语言模型缺乏处理或输出像素形式的视觉信息的能力,我们在研究中使用代码来表示图像。尽管LLM生成的图像看起来不像自然图像,但在图像生成和模型纠正这些生成图像的能力方面的结果表明,对字符串的精确建模可以教会语言模型有关视觉世界的许多方面。此外,利用文本模型生成的图像进行自监督视觉表示学习的实验突显了使用仅LLMs就能训练出能够对自然图像进行语义评估的视觉模型的潜力。
English
What does learning to model relationships between strings teach large
language models (LLMs) about the visual world? We systematically evaluate LLMs'
abilities to generate and recognize an assortment of visual concepts of
increasing complexity and then demonstrate how a preliminary visual
representation learning system can be trained using models of text. As language
models lack the ability to consume or output visual information as pixels, we
use code to represent images in our study. Although LLM-generated images do not
look like natural images, results on image generation and the ability of models
to correct these generated images indicate that precise modeling of strings can
teach language models about numerous aspects of the visual world. Furthermore,
experiments on self-supervised visual representation learning, utilizing images
generated with text models, highlight the potential to train vision models
capable of making semantic assessments of natural images using just LLMs.