CoLLaVO：蜡笔大语言与视觉模型

摘要

大型语言模型（LLMs）和指导调整的显著成功推动了视觉语言模型（VLMs）朝着多功能通用模型的发展。然而，目前的VLMs是否真正具备从“图像中有哪些对象？”或“哪个对象对应于指定的边界框？”等方面确定质量对象级图像理解能力尚未被探索。我们的研究结果显示，当前VLMs的图像理解能力与它们在视觉语言（VL）任务的零样本性能密切相关。这表明，优先考虑基本图像理解对于VLMs在VL任务中表现出色至关重要。为了增强对象级图像理解，我们提出了基于彩色笔记提示的Crayon大型语言和视觉模型（CoLLaVO），该模型将指导调整与蜡笔提示相结合，基于全景色彩地图的新视觉提示调整方案。此外，我们提出了一种名为Dual QLoRA的学习策略，以在视觉指导调整过程中保持对象级图像理解而不遗忘它，从而在零样本众多VL基准测试中取得了显著进展。

English

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.

CoLLaVO：蜡笔大语言与视觉模型

CoLLaVO: Crayon Large Language and Vision mOdel

摘要

Support