코스모스-2: 멀티모달 대형 언어 모델의 현실 기반 구축

초록

본 논문에서는 객체 설명(예: 경계 상자)을 인식하고 텍스트를 시각적 세계에 연결하는 새로운 기능을 제공하는 다중모드 대형 언어 모델(MLLM)인 Kosmos-2를 소개한다. 구체적으로, 참조 표현을 마크다운의 링크 형태로 표현하며, 즉 ``[텍스트 범위](경계 상자)''와 같이 객체 설명을 위치 토큰의 시퀀스로 나타낸다. 다중모드 코퍼스와 함께, 모델을 학습시키기 위해 대규모의 연결된 이미지-텍스트 쌍 데이터(이하 GrIT)를 구축하였다. 기존 MLLM의 기능(예: 일반 모드 인식, 지시 따르기, 문맥 내 학습 수행) 외에도, Kosmos-2는 다운스트림 애플리케이션에 연결 기능을 통합한다. Kosmos-2는 다양한 작업에서 평가되었으며, 이에는 (i) 참조 표현 이해 및 구문 연결과 같은 다중모드 연결, (ii) 참조 표현 생성과 같은 다중모드 참조, (iii) 인식-언어 작업, (iv) 언어 이해 및 생성이 포함된다. 이 연구는 구현형 AI 개발의 기반을 마련하며, 언어, 다중모드 인식, 행동, 세계 모델링의 대규모 융합을 조명하여 인공 일반 지능으로 나아가는 중요한 단계를 제시한다. 데이터, 데모 및 사전 학습된 모델은 https://aka.ms/kosmos-2에서 확인할 수 있다.

English

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Data, demo, and pretrained models are available at https://aka.ms/kosmos-2.

코스모스-2: 멀티모달 대형 언어 모델의 현실 기반 구축

Kosmos-2: Grounding Multimodal Large Language Models to the World

초록

Support