一般から専門へ：タスク固有の視覚的指示調整を介したビジョン言語モデルの適応

要旨

大視覚言語モデル（VLMs）は、大規模な言語モデルとビジョンエンコーダを組み合わせ、さまざまなタスクで有望性を示しています。しかし、事前学習とファインチューニングの間のドメインのギャップにより、タスク固有のアプリケーションでしばしば性能が低下します。本研究では、タスク固有のモデル（TSMs）を統合することで、VLMsのタスク固有の適応性を高める新しいフレームワークであるVITaskを紹介します。VITaskは、例示プロンプティング（EP）、応答分布の整合（RDA）、および対照的な応答チューニング（CRT）という3つの主要な戦略を採用し、VLMsの応答分布を調整することでタスク固有のパフォーマンスを向上させます。EPは、TSMの特徴がVLMsを導くことを可能にし、RDAは、例示プロンプトモデルから学習することで、TSMなしで推論中にVLMsを適応させます。CRTは、正しい画像応答ペアのランキングをさらに最適化し、望ましくない応答を生成するリスクを軽減します。9つの画像モダリティを横断する12の医学診断データセットでの実験結果は、VITaskがバニラの指示チューニングされたVLMsおよびTSMsを上回り、両モデルから補完的な特徴を効果的に統合する能力を示しています。さらに、VITaskは、柔軟なTSM統合や不完全な指示に対する堅牢性など、実用的な利点を提供し、タスク固有のVLMチューニングのための多目的かつ効率的なソリューションとなっています。当該コードは、https://github.com/baiyang4/VITask で入手可能です。

English

Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs three key strategies: exemplar prompting (EP), response distribution alignment (RDA), and contrastive response tuning (CRT) to improve the task-specific performance of VLMs by adjusting their response distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to adapt without TSMs during inference by learning from exemplar-prompted models. CRT further optimizes the ranking of correct image-response pairs, thereby reducing the risk of generating undesired responses. Experiments on 12 medical diagnosis datasets across 9 imaging modalities show that VITask outperforms both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to integrate complementary features from both models effectively. Additionally, VITask offers practical advantages such as flexible TSM integration and robustness to incomplete instructions, making it a versatile and efficient solution for task-specific VLM tuning. Our code are available at https://github.com/baiyang4/VITask.

一般から専門へ：タスク固有の視覚的指示調整を介したビジョン言語モデルの適応

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

要旨

Support