픽셀 정렬 언어 모델

초록

대규모 언어 모델은 최근 몇 년 동안 큰 성공을 거두었으며, 이는 비전 분야에서의 변형 모델들도 마찬가지입니다. 기존의 비전-언어 모델들은 이미지를 자연어로 설명하거나 시각적 질문에 답변하거나 이미지에 대한 복잡한 추론을 수행할 수 있습니다. 그러나 단어 그라운딩이나 참조 지역화와 같은 지역화 작업을 대규모 언어 모델을 사용하여 어떻게 수행할 수 있는지는 아직 명확하지 않습니다. 본 연구에서는 위치(예: 점 집합 또는 박스)를 입력 또는 출력으로 처리할 수 있는 비전-언어 모델을 개발하는 것을 목표로 합니다. 위치를 입력으로 처리할 때, 모델은 지정된 객체 또는 영역에 대한 캡션을 생성하는 위치 조건 캡셔닝을 수행합니다. 위치를 출력으로 생성할 때, 모델은 언어 모델에 의해 생성된 각 출력 단어에 대한 픽셀 좌표를 회귀하여 조밀한 단어 그라운딩을 수행합니다. 우리의 모델은 인간의 주의력에서 얻은 픽셀-단어 정렬 캡셔닝을 포함한 Localized Narrative 데이터셋에서 사전 학습되었습니다. 우리는 이 모델이 참조 지역화, 위치 조건 캡셔닝, 조밀한 객체 캡셔닝을 포함한 다양한 위치 인식 비전-언어 작업에 적용될 수 있으며, RefCOCO 및 Visual Genome에서 최첨단 성능을 달성할 수 있음을 보여줍니다. 프로젝트 페이지: https://jerryxu.net/PixelLLM.

English

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

픽셀 정렬 언어 모델

Pixel Aligned Language Models

초록

Support