ChatPaper.aiChatPaper

从通才到专家:通过任务特定的视觉指导调整,使视觉语言模型适应。

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

October 9, 2024
作者: Yang Bai, Yang Zhou, Jun Zhou, Rick Siow Mong Goh, Daniel Shu Wei Ting, Yong Liu
cs.AI

摘要

大型视觉语言模型(VLMs)将大型语言模型与视觉编码器结合,展示了在各种任务中的潜力。然而,由于预训练和微调之间的领域差距,它们在特定任务应用中经常表现不佳。我们引入了VITask,这是一个新颖的框架,通过集成任务特定模型(TSMs)来增强VLMs的任务特定适应性。VITask采用三种关键策略:范例提示(EP)、响应分布对齐(RDA)和对比响应调整(CRT),通过调整其响应分布来提高VLMs的任务特定性能。EP允许TSM特征引导VLMs,而RDA使VLMs能够在推断过程中无需TSMs进行适应,而是通过从示例提示模型中学习。CRT进一步优化了正确图像-响应对的排名,从而降低了生成不良响应的风险。在涵盖9种成像模式的12个医学诊断数据集上的实验证明,VITask优于传统的指令调整的VLMs和TSMs,展示了其有效整合两种模型的互补特性的能力。此外,VITask提供了实用优势,如灵活的TSM集成和对不完整指令的鲁棒性,使其成为任务特定VLM调整的多功能高效解决方案。我们的代码可在https://github.com/baiyang4/VITask找到。
English
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs three key strategies: exemplar prompting (EP), response distribution alignment (RDA), and contrastive response tuning (CRT) to improve the task-specific performance of VLMs by adjusting their response distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to adapt without TSMs during inference by learning from exemplar-prompted models. CRT further optimizes the ranking of correct image-response pairs, thereby reducing the risk of generating undesired responses. Experiments on 12 medical diagnosis datasets across 9 imaging modalities show that VITask outperforms both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to integrate complementary features from both models effectively. Additionally, VITask offers practical advantages such as flexible TSM integration and robustness to incomplete instructions, making it a versatile and efficient solution for task-specific VLM tuning. Our code are available at https://github.com/baiyang4/VITask.

Summary

AI-Generated Summary

PDF382November 16, 2024