像素对齐语言模型

摘要

近年来，大型语言模型取得了巨大成功，视觉领域的变体也如此。现有的视觉-语言模型能够用自然语言描述图像，回答与视觉相关的问题，或对图像进行复杂推理。然而，目前尚不清楚如何利用大型语言模型执行诸如词语定位或指代定位等定位任务。在这项工作中，我们旨在开发一个视觉-语言模型，可以将位置（例如一组点或框）作为输入或输出。当将位置作为输入时，模型执行基于位置的字幕生成，为指定的对象或区域生成字幕。当生成位置作为输出时，我们的模型通过回归每个语言模型生成的输出词的像素坐标，从而执行密集词语定位。我们的模型在定位叙事数据集上进行了预训练，该数据集包含了人类注意力的像素-词对齐字幕。我们展示了我们的模型可应用于各种位置感知的视觉-语言任务，包括指代定位、基于位置的字幕生成和密集对象字幕生成，在RefCOCO和Visual Genome上实现了最先进的性能。项目页面：https://jerryxu.net/PixelLLM。

English

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

像素对齐语言模型

Pixel Aligned Language Models

摘要

Support