從通才到專家:透過任務特定的視覺指導調整來適應視覺語言模型
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
October 9, 2024
作者: Yang Bai, Yang Zhou, Jun Zhou, Rick Siow Mong Goh, Daniel Shu Wei Ting, Yong Liu
cs.AI
摘要
大型視覺語言模型(VLMs)結合大型語言模型與視覺編碼器,展示了在各種任務中的潛力。然而,由於預訓練和微調之間的領域差異,它們在特定任務應用中通常表現不佳。我們引入了VITask,一個新穎的框架,通過整合特定任務模型(TSMs)來增強VLMs的特定任務適應性。VITask採用三個關鍵策略:範例提示(EP)、回應分佈對齊(RDA)和對比回應調整(CRT),通過調整其回應分佈來提高VLMs的特定任務性能。EP允許TSM特徵引導VLMs,而RDA使VLMs能夠在推論過程中無需TSMs進行適應,而是從範例提示的模型中學習。CRT進一步優化正確圖像-回應對的排名,從而降低生成不希望的回應的風險。在9種成像模式下的12個醫學診斷數據集上的實驗表明,VITask優於基本指令調整的VLMs和TSMs,展示了其有效整合兩種模型的互補特徵的能力。此外,VITask提供了實用優勢,如靈活的TSM整合和對不完整指令的穩健性,使其成為特定任務VLM調整的多功能高效解決方案。我們的程式碼可在https://github.com/baiyang4/VITask找到。
English
Large vision language models (VLMs) combine large language models with vision
encoders, demonstrating promise across various tasks. However, they often
underperform in task-specific applications due to domain gaps between
pre-training and fine-tuning. We introduce VITask, a novel framework that
enhances task-specific adaptability of VLMs by integrating task-specific models
(TSMs). VITask employs three key strategies: exemplar prompting (EP), response
distribution alignment (RDA), and contrastive response tuning (CRT) to improve
the task-specific performance of VLMs by adjusting their response
distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to
adapt without TSMs during inference by learning from exemplar-prompted models.
CRT further optimizes the ranking of correct image-response pairs, thereby
reducing the risk of generating undesired responses. Experiments on 12 medical
diagnosis datasets across 9 imaging modalities show that VITask outperforms
both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to
integrate complementary features from both models effectively. Additionally,
VITask offers practical advantages such as flexible TSM integration and
robustness to incomplete instructions, making it a versatile and efficient
solution for task-specific VLM tuning. Our code are available at
https://github.com/baiyang4/VITask.Summary
AI-Generated Summary