語言模型的視覺檢查
A Vision Check-up for Language Models
January 3, 2024
作者: Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba
cs.AI
摘要
學習建模字串之間的關係能教導大型語言模型(LLMs)有關視覺世界的知識嗎?我們系統性評估LLMs生成和識別各種視覺概念的能力,這些概念的複雜程度逐漸增加,並展示如何使用文本模型來訓練初步的視覺表示學習系統。由於語言模型缺乏消耗或輸出視覺信息像素的能力,我們在研究中使用代碼來表示圖像。儘管LLM生成的圖像看起來不像自然圖像,但在圖像生成和模型糾正這些生成的圖像的能力方面的結果表明,對字串進行精確建模可以教導語言模型有關視覺世界眾多方面的知識。此外,利用使用文本模型生成的圖像進行自監督視覺表示學習的實驗突顯了僅使用LLMs就能訓練能夠對自然圖像進行語義評估的視覺模型的潛力。
English
What does learning to model relationships between strings teach large
language models (LLMs) about the visual world? We systematically evaluate LLMs'
abilities to generate and recognize an assortment of visual concepts of
increasing complexity and then demonstrate how a preliminary visual
representation learning system can be trained using models of text. As language
models lack the ability to consume or output visual information as pixels, we
use code to represent images in our study. Although LLM-generated images do not
look like natural images, results on image generation and the ability of models
to correct these generated images indicate that precise modeling of strings can
teach language models about numerous aspects of the visual world. Furthermore,
experiments on self-supervised visual representation learning, utilizing images
generated with text models, highlight the potential to train vision models
capable of making semantic assessments of natural images using just LLMs.