일반적인 전문가로부터 전문가로: 과제별 시각 지시 조정을 통한 비전 언어 모델의 적응

초록

대형 비전 언어 모델(VLMs)은 대형 언어 모델과 비전 인코더를 결합하여 다양한 작업에서 유망성을 보여줍니다. 그러나 사전 훈련과 세부 튜닝 간의 도메인 갭으로 인해 특정 작업 응용에서 종종 성능이 부족합니다. 저희는 VITask라는 새로운 프레임워크를 소개합니다. 이는 작업별 모델(TSMs)을 통합하여 VLMs의 작업별 적응성을 향상시킵니다. VITask는 예시 프롬프팅(EP), 응답 분포 정렬(RDA) 및 대조적 응답 튜닝(CRT)이라는 세 가지 주요 전략을 활용하여 VLMs의 작업별 성능을 향상시킵니다. EP는 TSM 특징이 VLMs를 안내하도록 허용하며, RDA는 VLMs가 예시 프롬프팅된 모델로부터 학습하여 TSM 없이 추론 중에 적응할 수 있게 합니다. CRT는 올바른 이미지-응답 쌍의 순위를 더 최적화하여 원치 않는 응답을 생성하는 위험을 줄입니다. 9가지 영상 모달리티를 포함한 12가지 의료 진단 데이터셋에서의 실험 결과는 VITask가 바닐라 지시 조정된 VLMs와 TSMs보다 우수함을 보여주며, 두 모델의 보완적 특징을 효과적으로 통합하는 능력을 보여줍니다. 또한, VITask는 유연한 TSM 통합과 불완전한 지시에 대한 견고성과 같은 실용적 이점을 제공하여 작업별 VLM 튜닝에 다재다능하고 효율적인 솔루션이 됩니다. 저희 코드는 https://github.com/baiyang4/VITask에서 확인하실 수 있습니다.

English

Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs three key strategies: exemplar prompting (EP), response distribution alignment (RDA), and contrastive response tuning (CRT) to improve the task-specific performance of VLMs by adjusting their response distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to adapt without TSMs during inference by learning from exemplar-prompted models. CRT further optimizes the ranking of correct image-response pairs, thereby reducing the risk of generating undesired responses. Experiments on 12 medical diagnosis datasets across 9 imaging modalities show that VITask outperforms both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to integrate complementary features from both models effectively. Additionally, VITask offers practical advantages such as flexible TSM integration and robustness to incomplete instructions, making it a versatile and efficient solution for task-specific VLM tuning. Our code are available at https://github.com/baiyang4/VITask.

일반적인 전문가로부터 전문가로: 과제별 시각 지시 조정을 통한 비전 언어 모델의 적응

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

초록

Support