逐一列出項目:一種新的資料來源和學習範式,適用於多模式LLM模型
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
April 25, 2024
作者: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang
cs.AI
摘要
Mark集合(SoM)提示釋放了GPT-4V的視覺基準能力,使模型能夠將視覺物件與插入圖像的標籤相關聯。這些標籤以字母數字標記,可以通過文本標記進行索引以便輕鬆查閱。儘管GPT-4V表現出色,我們觀察到其他多模式大型語言模型(MLLMs)難以理解這些視覺標籤。為了促進開源模型對SoM提示的學習,我們提出了一種新的學習範式:“逐一列出項目”,該範式要求模型枚舉並描述圖像上按照字母數字順序放置的所有視覺標籤。通過將我們的策劃數據集與其他視覺指導調整數據集相結合,我們能夠為現有的MLLMs配備SoM提示能力。此外,我們在五個MLLM基準上評估了我們微調的SoM模型。我們發現,即使在相對較小的數據集(10k-30k張帶有標籤的圖像)中,這個新數據集也顯著增強了視覺推理能力,並減少了MLLMs的幻覺。也許令人驚訝的是,即使在推理過程中省略了輸入圖像中的視覺標籤,這些改進仍然持續存在。這表明“逐一列出項目”有潛力作為訓練MLLMs的新範式,通過在訓練階段使用視覺標籤來加強物件-文本對齊。最後,我們通過探測訓練模型來進行分析,以了解SoM的工作機制。我們的代碼和數據可在https://github.com/zzxslp/SoM-LLaVA找到。
English
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of
GPT-4V, by enabling the model to associate visual objects with tags inserted on
the image. These tags, marked with alphanumerics, can be indexed via text
tokens for easy reference. Despite the extraordinary performance from GPT-4V,
we observe that other Multimodal Large Language Models (MLLMs) struggle to
understand these visual tags. To promote the learning of SoM prompting for
open-source models, we propose a new learning paradigm: "list items one by
one," which asks the model to enumerate and describe all visual tags placed on
the image following the alphanumeric orders of tags. By integrating our curated
dataset with other visual instruction tuning datasets, we are able to equip
existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our
finetuned SoM models on five MLLM benchmarks. We find that this new dataset,
even in a relatively small size (10k-30k images with tags), significantly
enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
Perhaps surprisingly, these improvements persist even when the visual tags are
omitted from input images during inference. This suggests the potential of
"list items one by one" as a new paradigm for training MLLMs, which strengthens
the object-text alignment through the use of visual tags in the training stage.
Finally, we conduct analyses by probing trained models to understand the
working mechanism of SoM. Our code and data are available at
https://github.com/zzxslp/SoM-LLaVA.Summary
AI-Generated Summary