Groma:用于为基于地面多模态大语言模型提供定位的本地化视觉标记化
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
April 19, 2024
作者: Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
cs.AI
摘要
我们介绍了Groma,这是一个具有基于视觉感知的多模态大型语言模型(MLLM)。除了对整体图像的理解,Groma擅长区域级任务,如区域描述和视觉对齐。这些能力是建立在一种局部化的视觉标记机制之上的,其中图像输入被分解为感兴趣的区域,随后被编码为区域标记。通过将区域标记整合到用户指令和模型响应中,我们无缝地使Groma能够理解用户指定的区域输入,并将其文本输出与图像对齐。此外,为了增强Groma的基于视觉对齐的聊天能力,我们通过利用强大的GPT-4V和视觉提示技术,策划了一个视觉对齐的指令数据集。与依赖语言模型或外部模块进行定位的MLLM相比,Groma在标准指代和对齐基准测试中始终展现出卓越的性能,突显了将定位嵌入图像标记化的优势。项目页面:https://groma-mllm.github.io/。
English
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded
and fine-grained visual perception ability. Beyond holistic image
understanding, Groma is adept at region-level tasks such as region captioning
and visual grounding. Such capabilities are built upon a localized visual
tokenization mechanism, where an image input is decomposed into regions of
interest and subsequently encoded into region tokens. By integrating region
tokens into user instructions and model responses, we seamlessly enable Groma
to understand user-specified region inputs and ground its textual output to
images. Besides, to enhance the grounded chat ability of Groma, we curate a
visually grounded instruction dataset by leveraging the powerful GPT-4V and
visual prompting techniques. Compared with MLLMs that rely on the language
model or external module for localization, Groma consistently demonstrates
superior performances in standard referring and grounding benchmarks,
highlighting the advantages of embedding localization into image tokenization.
Project page: https://groma-mllm.github.io/.Summary
AI-Generated Summary