大型語言模型僅透過閱讀就能隱含地學會視覺與聽覺理解

摘要

本文揭示了一個引人入勝的發現：通過在文本標記上訓練自回歸大型語言模型（LLM），該文本模型會內在地發展出理解圖像和音頻的能力，從而僅通過閱讀就能獲得視覺和聽覺的能力。流行的音頻和視覺LLM模型通常會對文本LLM模型進行微調，以根據圖像和音頻嵌入生成文本輸出。而我們的架構則以圖像塊、音頻波形或標記作為輸入，並輸出典型分類流程中的嵌入或類別標籤。我們展示了文本權重在輔助音頻分類（針對FSD-50K和GTZAN數據集）中的通用性。此外，我們還展示了其在CIFAR-10和Fashion-MNIST圖像分類以及圖像塊上的應用。這進一步推動了文本LLM學習強大內部電路的理念，這些電路可以通過激活必要連接來應用於各種場景，而無需每次都從頭開始訓練模型。

English

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

大型語言模型僅透過閱讀就能隱含地學會視覺與聽覺理解

Large Language Models Implicitly Learn to See and Hear Just By Reading

摘要

Support