逐一列出项目:一种新的数据源和学习范式,用于多模态LLM
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
April 25, 2024
作者: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang
cs.AI
摘要
Set-of-Mark(SoM)Prompting释放了GPT-4V的视觉基准能力,使模型能够将视觉对象与插入图像的标记相关联。这些标记用字母数字标记,可以通过文本标记进行索引以便轻松引用。尽管GPT-4V表现出色,我们发现其他多模态大型语言模型(MLLMs)难以理解这些视觉标记。为了促进开源模型对SoM提示的学习,我们提出了一种新的学习范式:“逐一列出项目”,要求模型枚举并描述按照标记的字母数字顺序放置在图像上的所有视觉标记。通过将我们的策划数据集与其他视觉指导调整数据集相结合,我们能够为现有的MLLMs提供SoM提示能力。此外,我们在五个MLLM基准测试上评估了我们微调的SoM模型。我们发现,即使是在相对较小的规模(10k-30k带标记的图像)下,这个新数据集也显著增强了视觉推理能力,并减少了MLLMs的幻觉。也许令人惊讶的是,即使在推理过程中省略了输入图像中的视觉标记,这些改进仍然持续存在。这表明“逐一列出项目”有潜力成为训练MLLMs的新范式,通过在训练阶段使用视觉标记来加强对象-文本对齐。最后,我们通过对经过训练的模型进行探究性分析来了解SoM的工作机制。我们的代码和数据可在https://github.com/zzxslp/SoM-LLaVA找到。
English
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of
GPT-4V, by enabling the model to associate visual objects with tags inserted on
the image. These tags, marked with alphanumerics, can be indexed via text
tokens for easy reference. Despite the extraordinary performance from GPT-4V,
we observe that other Multimodal Large Language Models (MLLMs) struggle to
understand these visual tags. To promote the learning of SoM prompting for
open-source models, we propose a new learning paradigm: "list items one by
one," which asks the model to enumerate and describe all visual tags placed on
the image following the alphanumeric orders of tags. By integrating our curated
dataset with other visual instruction tuning datasets, we are able to equip
existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our
finetuned SoM models on five MLLM benchmarks. We find that this new dataset,
even in a relatively small size (10k-30k images with tags), significantly
enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
Perhaps surprisingly, these improvements persist even when the visual tags are
omitted from input images during inference. This suggests the potential of
"list items one by one" as a new paradigm for training MLLMs, which strengthens
the object-text alignment through the use of visual tags in the training stage.
Finally, we conduct analyses by probing trained models to understand the
working mechanism of SoM. Our code and data are available at
https://github.com/zzxslp/SoM-LLaVA.Summary
AI-Generated Summary