CoLLaVO: 크레용 대형 언어 및 비전 모델

초록

대규모 언어 모델(LLMs)과 명령어 튜닝의 놀라운 성공은 비전 언어 모델(VLMs)의 진화를 다목적 범용 모델로 이끌고 있습니다. 그러나 현재의 VLMs가 '이미지에 어떤 객체가 있는가?' 또는 '지정된 바운딩 박스에 해당하는 객체는 무엇인가?'와 같은 질문을 통해 진정한 객체 수준의 이미지 이해 능력을 보유하고 있는지 여부는 아직 탐구되지 않았습니다. 우리의 연구 결과는 현재 VLMs의 이미지 이해 능력이 비전 언어(VL) 작업에서의 제로샷 성능과 강한 상관관계가 있음을 보여줍니다. 이는 VLMs가 VL 작업에서 뛰어나기 위해 기본적인 이미지 이해를 우선시하는 것이 중요함을 시사합니다. 객체 수준의 이미지 이해를 향상시키기 위해, 우리는 팬옵틱 컬러 맵을 기반으로 한 새로운 시각적 프롬프트 튜닝 방식인 크레용 프롬프트를 통한 명령어 튜닝을 통합한 Crayon Large Language and Vision mOdel(CoLLaVO)을 제안합니다. 또한, 시각적 명령어 튜닝 과정에서 객체 수준의 이미지 이해를 잊지 않고 보존하기 위한 Dual QLoRA 학습 전략을 제시함으로써, 다양한 VL 벤치마크에서의 제로샷 성능을 크게 향상시켰습니다.

English

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.

CoLLaVO: 크레용 대형 언어 및 비전 모델

CoLLaVO: Crayon Large Language and Vision mOdel

초록

Support