GLaMM: 픽셀 기반 대규모 멀티모달 모델

초록

대규모 다중모달 모델(Large Multimodal Models, LMMs)은 대규모 언어 모델(Large Language Models)을 시각 영역으로 확장한 것이다. 초기의 LMMs 연구는 전체 이미지와 텍스트 프롬프트를 사용하여 근거 없는 텍스트 응답을 생성하는 데 초점을 맞추었다. 최근에는 영역 수준의 LMMs가 시각적으로 근거 있는 응답을 생성하는 데 사용되고 있다. 그러나 이러한 모델들은 한 번에 단일 객체 범주만 참조할 수 있고, 사용자가 입력에서 영역을 지정해야 하거나, 조밀한 픽셀 단위 객체 근거를 제공하지 못하는 한계가 있다. 본 연구에서는 자연어 응답과 해당 객체 분할 마스크를 원활하게 결합하여 생성할 수 있는 최초의 모델인 Grounding LMM(GLaMM)을 제안한다. GLaMM은 대화에서 등장하는 객체를 근거로 삼을 뿐만 아니라, 텍스트 및 선택적 시각 프롬프트(관심 영역)를 입력으로 받아들이는 유연성을 갖추고 있다. 이를 통해 사용자는 텍스트와 시각 영역에서 다양한 세분화 수준으로 모델과 상호작용할 수 있다. 시각적으로 근거 있는 상세한 대화를 생성하는 새로운 설정을 위한 표준 벤치마크가 부족한 상황에서, 우리는 정제된 근거 있는 대화를 포함한 포괄적인 평가 프로토콜을 도입한다. 우리가 제안한 Grounded Conversation Generation(GCG) 작업은 대규모 자연 장면에서 조밀하게 근거 있는 개념을 요구한다. 이를 위해 우리는 810M개의 영역에 걸쳐 7.5M개의 고유 개념을 포함하는 조밀하게 주석이 달린 Grounding-anything Dataset(GranD)을 제안하며, 이는 우리가 제안한 자동화된 주석 파이프라인을 통해 생성되었다. GCG 외에도 GLaMM은 참조 표현 분할, 이미지 및 영역 수준 캡셔닝, 시각-언어 대화 등 여러 하위 작업에서도 효과적으로 수행된다. 프로젝트 페이지: https://mbzuai-oryx.github.io/groundingLMM.

English

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region-level captioning and vision-language conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.

GLaMM: 픽셀 기반 대규모 멀티모달 모델

GLaMM: Pixel Grounding Large Multimodal Model

초록

Support