Groma：用於定位多模式大型語言模型的視覺標記化

摘要

我們介紹了 Groma，一個具有扎根和細緻視覺感知能力的多模式大型語言模型（MLLM）。除了對整體圖像的理解，Groma 擅長於區域級任務，如區域字幕和視覺對應。這些能力建立在一種局部化視覺標記機制之上，其中圖像輸入被分解為感興趣的區域，並隨後被編碼為區域標記。通過將區域標記整合到用戶指令和模型回應中，我們無縫地使 Groma 能夠理解用戶指定的區域輸入並將其文本輸出與圖像相關聯。此外，為了增強 Groma 的扎根對話能力，我們通過利用強大的 GPT-4V 和視覺提示技術來精心策劃了一個具有視覺基礎的指令數據集。與依賴語言模型或外部模塊進行本地化的MLLM相比，Groma 在標準指稱和對應基準測試中始終展現出優異的性能，突顯了將本地化嵌入圖像標記化的優勢。項目頁面：https://groma-mllm.github.io/。

English

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.

Groma：用於定位多模式大型語言模型的視覺標記化

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

摘要

Support