ChatPaper.aiChatPaper

從通才到專家:透過任務特定的視覺指導調整來適應視覺語言模型

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

October 9, 2024
作者: Yang Bai, Yang Zhou, Jun Zhou, Rick Siow Mong Goh, Daniel Shu Wei Ting, Yong Liu
cs.AI

摘要

大型視覺語言模型(VLMs)結合大型語言模型與視覺編碼器,展示了在各種任務中的潛力。然而,由於預訓練和微調之間的領域差異,它們在特定任務應用中通常表現不佳。我們引入了VITask,一個新穎的框架,通過整合特定任務模型(TSMs)來增強VLMs的特定任務適應性。VITask採用三個關鍵策略:範例提示(EP)、回應分佈對齊(RDA)和對比回應調整(CRT),通過調整其回應分佈來提高VLMs的特定任務性能。EP允許TSM特徵引導VLMs,而RDA使VLMs能夠在推論過程中無需TSMs進行適應,而是從範例提示的模型中學習。CRT進一步優化正確圖像-回應對的排名,從而降低生成不希望的回應的風險。在9種成像模式下的12個醫學診斷數據集上的實驗表明,VITask優於基本指令調整的VLMs和TSMs,展示了其有效整合兩種模型的互補特徵的能力。此外,VITask提供了實用優勢,如靈活的TSM整合和對不完整指令的穩健性,使其成為特定任務VLM調整的多功能高效解決方案。我們的程式碼可在https://github.com/baiyang4/VITask找到。
English
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs three key strategies: exemplar prompting (EP), response distribution alignment (RDA), and contrastive response tuning (CRT) to improve the task-specific performance of VLMs by adjusting their response distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to adapt without TSMs during inference by learning from exemplar-prompted models. CRT further optimizes the ranking of correct image-response pairs, thereby reducing the risk of generating undesired responses. Experiments on 12 medical diagnosis datasets across 9 imaging modalities show that VITask outperforms both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to integrate complementary features from both models effectively. Additionally, VITask offers practical advantages such as flexible TSM integration and robustness to incomplete instructions, making it a versatile and efficient solution for task-specific VLM tuning. Our code are available at https://github.com/baiyang4/VITask.

Summary

AI-Generated Summary

PDF382November 16, 2024