事前学習済みテキスト専用トランスフォーマーにおけるマルチモーダルニューロン

要旨

言語モデルは、あるモダリティで学習した表現を他のモダリティの下流タスクに一般化する驚くべき能力を示します。この能力を個々のニューロンに遡ることができるでしょうか？本研究では、凍結されたテキストトランスフォーマーを自己教師あり視覚エンコーダと画像からテキストへのタスクで学習された単一の線形射影を用いて視覚的に拡張した場合を検討します。射影層の出力は、画像内容を説明する言語に即座にデコードされるわけではありません。代わりに、モダリティ間の変換はトランスフォーマーのより深い層で発生していることがわかります。我々は、視覚表現を対応するテキストに変換する「マルチモーダルニューロン」を特定し、それらがモデルの残差ストリームに注入する概念をデコードする手順を導入します。一連の実験を通じて、マルチモーダルニューロンが特定の視覚概念に対して入力に依存せずに作用し、画像キャプショニングに系統的な因果的影響を及ぼすことを示します。

English

Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.

事前学習済みテキスト専用トランスフォーマーにおけるマルチモーダルニューロン

Multimodal Neurons in Pretrained Text-Only Transformers

要旨

Support