大規模言語モデルは、単にテキストを読むだけで視覚と聴覚を暗黙的に学習する

要旨

本論文は興味深い発見を提示する：テキストトークンを用いて自己回帰型LLMモデルを訓練すると、テキストモデルは内部で画像や音声を理解する能力を自然に発達させ、読むだけで見て聞く能力を獲得する。一般的な音声・視覚LLMモデルは、画像や音声の埋め込みを条件としたテキスト出力を得るためにテキストLLMモデルをファインチューニングする。一方、我々のアーキテクチャは、画像パッチ、音声波形、またはトークンを入力として受け取り、分類パイプラインに典型的な埋め込みやカテゴリラベルを出力する。我々は、テキストの重みが音声分類（FSD-50KおよびGTZANデータセット）を支援する汎用性を示す。さらに、CIFAR-10やFashion-MNISTの画像分類、および画像パッチにおいてもこの効果を実証する。これは、テキストLLMが強力な内部回路を学習し、毎回ゼロからモデルを訓練するのではなく、必要な接続を活性化することで様々なアプリケーションに活用できるという概念を推し進めるものである。

English

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

大規模言語モデルは、単にテキストを読むだけで視覚と聴覚を暗黙的に学習する

Large Language Models Implicitly Learn to See and Hear Just By Reading

要旨

Support