预训练的纯文本变压器中的多模态神经元

摘要

语言模型展示了在一个模态中学习的表示能够泛化到其他模态的下游任务的显著能力。我们能否将这种能力追溯到单个神经元？我们研究了一个冻结的文本变换器，通过自监督视觉编码器和在图像到文本任务上学习的单个线性投影来增强视觉。投影层的输出不能立即解码为描述图像内容的语言；相反，我们发现模态之间的转换发生在变换器的更深层。我们引入了一种识别“多模态神经元”的过程，这些神经元将视觉表示转换为相应的文本，并解码它们注入模型残差流的概念。在一系列实验中，我们展示了多模态神经元在不同输入上对特定视觉概念进行操作，并对图像字幕具有系统性因果影响。

English

Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.

预训练的纯文本变压器中的多模态神经元

Multimodal Neurons in Pretrained Text-Only Transformers

摘要

Support