像素對齊語言模型

摘要

近年來，大型語言模型取得了巨大成功，同樣地，在視覺領域也有其變體。現有的視覺語言模型能夠用自然語言描述圖像，回答與視覺相關的問題，或對圖像進行複雜推理。然而，目前尚不清楚如何使用大型語言模型執行定位任務，例如詞語對應或參照定位。在這項工作中，我們旨在開發一個視覺語言模型，可以將位置，例如一組點或方框，作為輸入或輸出。當將位置作為輸入時，該模型執行基於位置的字幕生成，為指定的物體或區域生成字幕。當生成位置作為輸出時，我們的模型對語言模型生成的每個輸出詞進行像素坐標回歸，從而執行密集詞語對應。我們的模型在定位敘事數據集上進行了預訓練，該數據集包含來自人類注意力的像素-詞語對齊字幕。我們展示了我們的模型可以應用於各種位置感知的視覺語言任務，包括參照定位、基於位置的字幕生成和密集物體字幕生成，在 RefCOCO 和 Visual Genome 上實現了最先進的性能。項目頁面：https://jerryxu.net/PixelLLM。

English

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

像素對齊語言模型

Pixel Aligned Language Models

摘要

Support