少数ショット視覚質問応答のためのメタ適応型プロンプト蒸留

要旨

大規模マルチモーダルモデル（LMMs）は、新しいタスクを最小限の監督で実行するために、文脈内学習（ICL）に依存することが多い。しかし、特に小規模なLMMsにおけるICLの性能は一貫性がなく、例が増えるにつれて必ずしも単調に向上するわけではない。我々は、この現象が、下流タスクに必要のない画像埋め込みに含まれる追加情報によってLMMが圧倒されるためであると仮説を立てた。この問題に対処するため、我々はメタ学習アプローチを提案し、タスク関連の画像特徴から蒸留され、テスト時に少数の例を用いて適応可能な固定セットのソフトプロンプトを使用して、LMMに少数ショット能力を誘導する代替手段を提供する。この蒸留を容易にするため、我々は、人気のあるLLaVA v1.5アーキテクチャに容易に統合可能で、ソフトプロンプトと共に学習されるアテンションマッパーモジュールを導入し、わずかな勾配ステップで低データ体制下でのLMMのタスク適応を可能にする。VL-ICLベンチマークでの評価により、我々の手法が、画像摂動下でもICLや関連するプロンプトチューニングアプローチを一貫して上回り、視覚的質問応答タスクにおけるタスク誘導と推論を改善することが示された。

English

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

少数ショット視覚質問応答のためのメタ適応型プロンプト蒸留

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

要旨

Support