GPT4RoI：區域感興趣上的指導調整大型語言模型

摘要

在圖像-文字配對上調整大型語言模型（LLM）已經實現了前所未有的視覺-語言多模能力。然而，它們的視覺-語言對齊僅建立在圖像層面上，缺乏區域層級的對齊限制了它們對細粒度多模理解的進展。本文提出在感興趣區域上進行指導調整。關鍵設計是將邊界框重新定義為空間指導的格式。由空間指導提取的視覺特徵的交錯序列和語言嵌入被輸入到LLM中，並在指導調整格式中的轉換區域-文字數據上進行訓練。我們的區域層級視覺-語言模型，稱為GPT4RoI，帶來了超越圖像層面理解的全新對話和互動體驗。（1）可控性：用戶可以通過語言和空間指導與我們的模型互動，靈活調整問題的細節水平。（2）容量：我們的模型不僅支持單區域空間指導，還支持多區域。這解鎖了更多區域層級多模容量，如詳細區域標題和複雜區域推理。（3）組成：任何現成的物體檢測器都可以成為空間指導提供者，以從我們的模型中挖掘信息豐富的物體屬性，如顏色、形狀、材料、動作、與其他物體的關係等。代碼、數據和演示可在https://github.com/jshilong/GPT4RoI 找到。

English

Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.

GPT4RoI：區域感興趣上的指導調整大型語言模型

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

摘要

Support