ChatPaper.aiChatPaper

CoLLaVO:蠟筆大型語言與視覺模型

CoLLaVO: Crayon Large Language and Vision mOdel

February 17, 2024
作者: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
cs.AI

摘要

大型語言模型(LLMs)和指示調整的顯著成功推動了視覺語言模型(VLMs)朝向多功能通用模型的演進。然而,目前的VLMs是否真正具有優質的物件級圖像理解能力,即從“圖像中有哪些物件?”或“哪個物件對應到指定的邊界框?”這一問題仍未被探討。我們的研究發現,目前的VLMs的圖像理解能力與它們在視覺語言(VL)任務的零樣本表現密切相關。這表明,將基本的圖像理解置於優先位置對於VLMs在VL任務中表現出色至關重要。為了增強物件級圖像理解,我們提出了Crayon大型語言和視覺模型(CoLLaVO),該模型將指示調整與蠟筆提示相結合,作為一種基於全景色彩地圖的新視覺提示調整方案。此外,我們提出了雙重QLoRA的學習策略,以在視覺指示調整過程中保留物件級圖像理解,從而在零樣本情況下在眾多VL基準上實現了顯著的飛躍。
English
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.
PDF236December 15, 2024