CoLLaVO:蜡笔大语言与视觉模型
CoLLaVO: Crayon Large Language and Vision mOdel
February 17, 2024
作者: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
cs.AI
摘要
大型语言模型(LLMs)和指导调整的显著成功推动了视觉语言模型(VLMs)朝着多功能通用模型的发展。然而,目前的VLMs是否真正具备从“图像中有哪些对象?”或“哪个对象对应于指定的边界框?”等方面确定质量对象级图像理解能力尚未被探索。我们的研究结果显示,当前VLMs的图像理解能力与它们在视觉语言(VL)任务的零样本性能密切相关。这表明,优先考虑基本图像理解对于VLMs在VL任务中表现出色至关重要。为了增强对象级图像理解,我们提出了基于彩色笔记提示的Crayon大型语言和视觉模型(CoLLaVO),该模型将指导调整与蜡笔提示相结合,基于全景色彩地图的新视觉提示调整方案。此外,我们提出了一种名为Dual QLoRA的学习策略,以在视觉指导调整过程中保持对象级图像理解而不遗忘它,从而在零样本众多VL基准测试中取得了显著进展。
English
The remarkable success of Large Language Models (LLMs) and instruction tuning
drives the evolution of Vision Language Models (VLMs) towards a versatile
general-purpose model. Yet, it remains unexplored whether current VLMs
genuinely possess quality object-level image understanding capabilities
determined from 'what objects are in the image?' or 'which object corresponds
to a specified bounding box?'. Our findings reveal that the image understanding
capabilities of current VLMs are strongly correlated with their zero-shot
performance on Vision Language (VL) tasks. This suggests that prioritizing
basic image understanding is crucial for VLMs to excel at VL tasks. To enhance
object-level image understanding, we propose Crayon Large Language and Vision
mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a
new visual prompt tuning scheme based on panoptic color maps. Furthermore, we
present a learning strategy of Dual QLoRA to preserve object-level image
understanding without forgetting it during visual instruction tuning, thereby
achieving a significant leap in zero-shot numerous VL benchmarks.Summary
AI-Generated Summary