대규모 언어 모델은 단순히 텍스트를 읽는 것만으로도 시각과 청각 정보를 암묵적으로 학습한다.

초록

본 논문은 흥미로운 발견을 제시합니다: 텍스트 토큰에 대해 자동회귀적 대형 언어 모델(LLM)을 학습시킬 때, 이 텍스트 모델은 내부적으로 이미지와 오디오를 이해하는 능력을 자연스럽게 개발하며, 단순히 읽기만으로도 보고 듣는 능력을 갖추게 된다는 것입니다. 일반적인 오디오 및 시각적 LLM 모델들은 이미지와 오디오 임베딩을 조건으로 하여 텍스트 출력을 제공하기 위해 텍스트 LLM 모델을 미세 조정합니다. 반면, 우리의 아키텍처는 이미지 패치, 오디오 파형 또는 토큰을 입력으로 받아들입니다. 이는 분류 파이프라인의 전형적인 임베딩이나 카테고리 레이블을 제공합니다. 우리는 텍스트 가중치가 FSD-50K 및 GTZAN 데이터셋에서 오디오 분류를 지원하는 데 있어 일반성을 가짐을 보여줍니다. 더 나아가, CIFAR-10 및 Fashion-MNIST에서의 이미지 분류와 이미지 패치에 대한 분류에서도 이러한 작동을 보여줍니다. 이는 텍스트 LLM이 강력한 내부 회로를 학습하며, 이를 다양한 응용 프로그램에 필요한 연결을 활성화함으로써 활용할 수 있다는 개념을 강화합니다. 이는 매번 모델을 처음부터 학습시킬 필요 없이 기존의 학습된 모델을 활용할 수 있음을 시사합니다.

English

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

대규모 언어 모델은 단순히 텍스트를 읽는 것만으로도 시각과 청각 정보를 암묵적으로 학습한다.

Large Language Models Implicitly Learn to See and Hear Just By Reading

초록

Support