語言模型的視覺檢查

摘要

學習建模字串之間的關係能教導大型語言模型（LLMs）有關視覺世界的知識嗎？我們系統性評估LLMs生成和識別各種視覺概念的能力，這些概念的複雜程度逐漸增加，並展示如何使用文本模型來訓練初步的視覺表示學習系統。由於語言模型缺乏消耗或輸出視覺信息像素的能力，我們在研究中使用代碼來表示圖像。儘管LLM生成的圖像看起來不像自然圖像，但在圖像生成和模型糾正這些生成的圖像的能力方面的結果表明，對字串進行精確建模可以教導語言模型有關視覺世界眾多方面的知識。此外，利用使用文本模型生成的圖像進行自監督視覺表示學習的實驗突顯了僅使用LLMs就能訓練能夠對自然圖像進行語義評估的視覺模型的潛力。

English

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

語言模型的視覺檢查

A Vision Check-up for Language Models

摘要

Support