タスクベクトルはクロスモーダルです。

要旨

私たちは、ビジョンと言語のモデル（VLMs）の内部表現と、それらがタスク表現をエンコードする方法を調査しています。テキストまたは画像の入力を使用して、例または指示によって指定されたタスクを検討します。驚くべきことに、概念的に類似したタスクは、どのように指定されているかに関係なく、類似したタスクベクトル表現にマッピングされることがわかりました。私たちの調査結果は、VLMs内のトークンが回答を出力するために、入力、タスク、回答の3つの異なるフェーズを経ることを示唆しており、このプロセスは異なるモダリティや仕様にわたって一貫しています。VLMs内で特定されたタスクベクトルは、1つのモダリティ（例：テキスト）で導出され、別のモダリティ（例：画像）に転送されるほど一般的です。さらに、例示と指示に基づくタスクベクトルをアンサンブル化すると、より良いタスク表現が生成されることがわかりました。これらの知見を総合すると、これらの洞察は、VLMsの基本的なメカニズムに光を当てており、特に異なるモダリティやタスクの仕様にわたってタスクを共有の方法で表現する能力について示唆しています。プロジェクトページ：https://task-vectors-are-cross-modal.github.io.

English

We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: https://task-vectors-are-cross-modal.github.io.

タスクベクトルはクロスモーダルです。

Task Vectors are Cross-Modal

要旨

Support