Groma：用于为基于地面多模态大语言模型提供定位的本地化视觉标记化

摘要

我们介绍了Groma，这是一个具有基于视觉感知的多模态大型语言模型（MLLM）。除了对整体图像的理解，Groma擅长区域级任务，如区域描述和视觉对齐。这些能力是建立在一种局部化的视觉标记机制之上的，其中图像输入被分解为感兴趣的区域，随后被编码为区域标记。通过将区域标记整合到用户指令和模型响应中，我们无缝地使Groma能够理解用户指定的区域输入，并将其文本输出与图像对齐。此外，为了增强Groma的基于视觉对齐的聊天能力，我们通过利用强大的GPT-4V和视觉提示技术，策划了一个视觉对齐的指令数据集。与依赖语言模型或外部模块进行定位的MLLM相比，Groma在标准指代和对齐基准测试中始终展现出卓越的性能，突显了将定位嵌入图像标记化的优势。项目页面：https://groma-mllm.github.io/。

English

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.

Groma：用于为基于地面多模态大语言模型提供定位的本地化视觉标记化

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

摘要

Support