言語モデルの視力検査

要旨

文字列間の関係をモデル化することを学ぶことが、大規模言語モデル（LLM）に視覚世界について何を教えるのか？本研究では、LLMが複雑さを増す様々な視覚概念を生成および認識する能力を体系的に評価し、テキストモデルを用いて予備的な視覚表現学習システムを訓練する方法を実証する。言語モデルはピクセルとしての視覚情報を入力または出力する能力を持たないため、本研究ではコードを使用して画像を表現する。LLMが生成した画像は自然画像のようには見えないが、画像生成の結果と、モデルがこれらの生成画像を修正する能力は、文字列の正確なモデル化が言語モデルに視覚世界の多くの側面を教えることができることを示している。さらに、テキストモデルで生成された画像を利用した自己教師あり視覚表現学習の実験は、LLMのみを使用して自然画像の意味的評価を行うことができる視覚モデルを訓練する可能性を強調している。

English

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

言語モデルの視力検査

A Vision Check-up for Language Models

要旨

Support