항목을 하나씩 나열하기: 멀티모달 대형 언어 모델을 위한 새로운 데이터 소스 및 학습 패러다임

초록

Set-of-Mark (SoM) 프롬프팅은 GPT-4V의 시각적 기반 능력을 극대화하여, 모델이 이미지에 삽입된 태그와 시각적 객체를 연결할 수 있도록 합니다. 이러한 태그는 알파벳과 숫자로 표시되며, 텍스트 토큰을 통해 쉽게 참조할 수 있도록 인덱싱됩니다. GPT-4V의 탁월한 성능에도 불구하고, 다른 멀티모달 대형 언어 모델(MLLM)들은 이러한 시각적 태그를 이해하는 데 어려움을 겪는 것으로 관찰되었습니다. 오픈소스 모델들이 SoM 프롬프팅을 학습할 수 있도록 돕기 위해, 우리는 새로운 학습 패러다임인 "항목을 하나씩 나열하기"를 제안합니다. 이 방법은 모델이 이미지에 배치된 모든 시각적 태그를 태그의 알파벳 순서에 따라 열거하고 설명하도록 요구합니다. 우리가 정제한 데이터셋을 다른 시각적 지침 튜닝 데이터셋과 통합함으로써, 기존 MLLM들이 SoM 프롬프팅 능력을 갖추도록 할 수 있었습니다. 또한, 우리는 미세 조정된 SoM 모델을 다섯 가지 MLLM 벤치마크에서 평가했습니다. 이 새로운 데이터셋은 비교적 작은 규모(10k-30k개의 태그가 있는 이미지)임에도 불구하고, MLLM의 시각적 추론 능력을 크게 향상시키고 환각 현상을 줄이는 데 상당한 효과가 있음을 발견했습니다. 놀랍게도, 이러한 개선은 추론 단계에서 입력 이미지에서 시각적 태그가 제거된 경우에도 지속되었습니다. 이는 "항목을 하나씩 나열하기"가 MLLM 훈련을 위한 새로운 패러다임으로서의 잠재력을 시사하며, 훈련 단계에서 시각적 태그를 사용함으로써 객체-텍스트 정렬을 강화할 수 있음을 보여줍니다. 마지막으로, 우리는 훈련된 모델을 탐구하여 SoM의 작동 메커니즘을 이해하기 위한 분석을 수행했습니다. 우리의 코드와 데이터는 https://github.com/zzxslp/SoM-LLaVA에서 확인할 수 있습니다.

English

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at https://github.com/zzxslp/SoM-LLaVA.

항목을 하나씩 나열하기: 멀티모달 대형 언어 모델을 위한 새로운 데이터 소스 및 학습 패러다임

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

초록

Summary

Support

Support