언어 모델을 위한 비전 점검

초록

문자열 간의 관계를 모델링하는 학습이 대규모 언어 모델(LLMs)에게 시각적 세계에 대해 무엇을 가르치는가? 우리는 LLM이 점점 복잡해지는 다양한 시각적 개념을 생성하고 인식하는 능력을 체계적으로 평가한 다음, 텍스트 모델을 사용하여 초기 시각적 표현 학습 시스템을 훈련시킬 수 있는 방법을 보여준다. 언어 모델은 시각적 정보를 픽셀로 소비하거나 출력할 수 있는 능력이 없기 때문에, 우리는 연구에서 이미지를 표현하기 위해 코드를 사용한다. LLM이 생성한 이미지가 자연스러운 이미지처럼 보이지는 않지만, 이미지 생성 결과와 모델이 이러한 생성된 이미지를 수정하는 능력은 문자열의 정밀한 모델링이 언어 모델에게 시각적 세계의 다양한 측면을 가르칠 수 있음을 나타낸다. 더 나아가, 텍스트 모델로 생성된 이미지를 활용한 자기 지도 시각적 표현 학습 실험은 LLM만을 사용하여 자연 이미지에 대한 의미론적 평가를 할 수 있는 시각 모델을 훈련시킬 수 있는 잠재력을 강조한다.

English

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

언어 모델을 위한 비전 점검

A Vision Check-up for Language Models

초록

Support