マルチモーダルタスクベクトルによる多数ショットマルチモーダルインコンテクスト学習の実現

要旨

最近のインターリーブ型大規模マルチモーダルモデル（LMM）のfew-shot学習における成功は、多数の例を用いた文脈内学習（ICL）が新しいタスクの学習において有望であることを示唆しています。しかし、この多数ショットのマルチモーダルICL設定には重要な問題があります。それは、事前学習時に設定されたモデルのコンテキスト長によって根本的に制限されていることです。この問題は、テキストと画像の両方を処理するマルチモーダル領域で特に顕著であり、追加のトークンを必要とします。これにより、ファインチューニングなしで多数のショットをより少ないトークンに圧縮するマルチモーダル手法の必要性が高まっています。本研究では、マルチモーダルタスクベクトル（MTV）—モデルのアテンションヘッドに圧縮された文脈内例のコンパクトな暗黙的表現—を活用することで、LMMがマルチモーダルな多数ショットの文脈内学習を実行できるようにします。具体的には、まずLMM内にそのようなMTVが存在することを実証し、次に抽出されたMTVを活用して、様々な視覚と言語タスクにおける多数ショットの文脈内学習を可能にします。実験結果から、MTVは圧縮されたショットの数に応じて性能がスケールし、推論時の追加のコンテキスト長なしで類似のドメイン外タスクに一般化できることが示唆されています。

English

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

マルチモーダルタスクベクトルによる多数ショットマルチモーダルインコンテクスト学習の実現

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

要旨

Support