GPT4RoI：在感兴趣区域上调整的大型语言模型指导

摘要

在图像-文本对上调整大型语言模型（LLM）已经实现了前所未有的视觉-语言多模态能力。然而，它们的视觉-语言对齐仅建立在图像级别上，缺乏区域级别对齐限制了它们对细粒度多模态理解的进展。本文提出在感兴趣区域上进行指导调整。关键设计是将边界框重新构造为空间指导的格式。由空间指导提取的交替序列的视觉特征和语言嵌入被输入到LLM，并在转换后的区域-文本数据上以指导调整格式进行训练。我们的区域级视觉-语言模型，命名为GPT4RoI，带来了超越图像级理解的全新对话和互动体验。（1）可控性：用户可以通过语言和空间指导与我们的模型交互，灵活调整问题的细节级别。（2）容量：我们的模型不仅支持单区域空间指导，还支持多区域。这解锁了更多区域级多模态容量，如详细区域说明和复杂区域推理。（3）组合：任何现成的物体检测器都可以成为空间指导提供者，以从我们的模型中挖掘信息丰富的物体属性，如颜色、形状、材质、动作、与其他物体的关系等。代码、数据和演示可在https://github.com/jshilong/GPT4RoI 找到。

English

Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.

GPT4RoI：在感兴趣区域上调整的大型语言模型指导

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

摘要

Support