ピクセル整列型言語モデル

要旨

大規模言語モデルは近年大きな成功を収めており、視覚分野におけるその変種も同様です。既存の視覚言語モデルは、画像を自然言語で説明したり、視覚関連の質問に答えたり、画像に関する複雑な推論を行ったりすることができます。しかし、単語のグラウンディングや参照ローカライゼーションなどのローカライゼーションタスクを大規模言語モデルを用いてどのように実行できるかはまだ明らかではありません。本研究では、位置情報（例えば、点の集合やボックス）を入力または出力として扱うことができる視覚言語モデルの開発を目指しています。位置情報を入力として扱う場合、モデルは位置条件付きキャプション生成を行い、指定されたオブジェクトや領域のキャプションを生成します。位置情報を出力として生成する場合、モデルは言語モデルによって生成された各出力単語に対してピクセル座標を回帰し、密な単語グラウンディングを実行します。私たちのモデルは、人間の注意に基づくピクセルと単語が整列したキャプションを含むLocalized Narrativeデータセットで事前学習されています。本モデルが、参照ローカライゼーション、位置条件付きキャプション生成、密なオブジェクトキャプション生成など、さまざまな位置認識視覚言語タスクに適用可能であり、RefCOCOおよびVisual Genomeにおいて最先端の性能を達成することを示します。プロジェクトページ: https://jerryxu.net/PixelLLM。

English

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

ピクセル整列型言語モデル

Pixel Aligned Language Models

要旨

Support