大型语言模型仅通过阅读即可隐式学会视觉与听觉理解。

摘要

本论文揭示了一项引人入胜的发现：通过训练一个自回归大语言模型（LLM）处理文本标记，该文本模型在内部自发地发展出了理解图像和音频的能力，从而仅通过阅读便获得了视觉与听觉的感知。当前流行的音频和视觉LLM模型通常是对文本LLM模型进行微调，以生成基于图像和音频嵌入的文本输出。相比之下，我们的架构直接接收图像块、音频波形或标记作为输入，并输出分类管道中典型的嵌入或类别标签。我们展示了文本权重在辅助音频分类任务中的普适性，特别是在FSD-50K和GTZAN数据集上的应用。此外，我们还验证了该方法在CIFAR-10、Fashion-MNIST以及图像块分类任务中的有效性。这一发现深化了文本LLM学习强大内部电路的概念，表明通过激活必要连接即可应用于多种场景，而无需每次都从头训练模型。

English

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

大型语言模型仅通过阅读即可隐式学会视觉与听觉理解。

Large Language Models Implicitly Learn to See and Hear Just By Reading

摘要

Support