预训练的纯文本变压器中的多模态神经元
Multimodal Neurons in Pretrained Text-Only Transformers
August 3, 2023
作者: Sarah Schwettmann, Neil Chowdhury, Antonio Torralba
cs.AI
摘要
语言模型展示了在一个模态中学习的表示能够泛化到其他模态的下游任务的显著能力。我们能否将这种能力追溯到单个神经元?我们研究了一个冻结的文本变换器,通过自监督视觉编码器和在图像到文本任务上学习的单个线性投影来增强视觉。投影层的输出不能立即解码为描述图像内容的语言;相反,我们发现模态之间的转换发生在变换器的更深层。我们引入了一种识别“多模态神经元”的过程,这些神经元将视觉表示转换为相应的文本,并解码它们注入模型残差流的概念。在一系列实验中,我们展示了多模态神经元在不同输入上对特定视觉概念进行操作,并对图像字幕具有系统性因果影响。
English
Language models demonstrate remarkable capacity to generalize representations
learned in one modality to downstream tasks in other modalities. Can we trace
this ability to individual neurons? We study the case where a frozen text
transformer is augmented with vision using a self-supervised visual encoder and
a single linear projection learned on an image-to-text task. Outputs of the
projection layer are not immediately decodable into language describing image
content; instead, we find that translation between modalities occurs deeper
within the transformer. We introduce a procedure for identifying "multimodal
neurons" that convert visual representations into corresponding text, and
decoding the concepts they inject into the model's residual stream. In a series
of experiments, we show that multimodal neurons operate on specific visual
concepts across inputs, and have a systematic causal effect on image
captioning.