像素对齐语言模型
Pixel Aligned Language Models
December 14, 2023
作者: Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
cs.AI
摘要
近年来,大型语言模型取得了巨大成功,视觉领域的变体也如此。现有的视觉-语言模型能够用自然语言描述图像,回答与视觉相关的问题,或对图像进行复杂推理。然而,目前尚不清楚如何利用大型语言模型执行诸如词语定位或指代定位等定位任务。在这项工作中,我们旨在开发一个视觉-语言模型,可以将位置(例如一组点或框)作为输入或输出。当将位置作为输入时,模型执行基于位置的字幕生成,为指定的对象或区域生成字幕。当生成位置作为输出时,我们的模型通过回归每个语言模型生成的输出词的像素坐标,从而执行密集词语定位。我们的模型在定位叙事数据集上进行了预训练,该数据集包含了人类注意力的像素-词对齐字幕。我们展示了我们的模型可应用于各种位置感知的视觉-语言任务,包括指代定位、基于位置的字幕生成和密集对象字幕生成,在RefCOCO和Visual Genome上实现了最先进的性能。项目页面:https://jerryxu.net/PixelLLM。
English
Large language models have achieved great success in recent years, so as
their variants in vision. Existing vision-language models can describe images
in natural languages, answer visual-related questions, or perform complex
reasoning about the image. However, it is yet unclear how localization tasks,
such as word grounding or referring localization, can be performed using large
language models. In this work, we aim to develop a vision-language model that
can take locations, for example, a set of points or boxes, as either inputs or
outputs. When taking locations as inputs, the model performs
location-conditioned captioning, which generates captions for the indicated
object or region. When generating locations as outputs, our model regresses
pixel coordinates for each output word generated by the language model, and
thus performs dense word grounding. Our model is pre-trained on the Localized
Narrative dataset, which contains pixel-word-aligned captioning from human
attention. We show our model can be applied to various location-aware
vision-language tasks, including referring localization, location-conditioned
captioning, and dense object captioning, archiving state-of-the-art performance
on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .