코스모스-2.5: 다중모드 리터러시 모델

초록

본 논문에서는 텍스트 집약적 이미지의 기계 독해를 위한 다중모달 리터러시 모델인 Kosmos-2.5를 소개한다. 대규모 텍스트 집약적 이미지 데이터로 사전 학습된 Kosmos-2.5는 두 가지 독립적이면서도 상호 협력적인 변환 작업에서 탁월한 성능을 보인다: (1) 공간 인식 텍스트 블록 생성, 즉 각 텍스트 블록에 이미지 내 공간 좌표를 할당하는 작업과 (2) 스타일과 구조를 마크다운 형식으로 포착하는 구조화된 텍스트 출력 생성. 이러한 통합된 다중모달 리터러시 능력은 공유된 Transformer 아키텍처, 작업별 프롬프트, 그리고 유연한 텍스트 표현을 통해 달성되었다. Kosmos-2.5는 문서 수준의 텍스트 인식(end-to-end document-level text recognition)과 이미지-마크다운 텍스트 생성(image-to-markdown text generation) 작업에서 평가되었다. 또한, 이 모델은 지도 미세 조정(supervised fine-tuning)을 통해 다양한 프롬프트로 텍스트 집약적 이미지 이해 작업에 쉽게 적용될 수 있어, 텍스트가 풍부한 이미지를 다루는 실제 응용 분야에서 범용 도구로 활용될 수 있다. 이 연구는 또한 다중모달 대규모 언어 모델의 미래 확장을 위한 길을 열어준다.

English

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

코스모스-2.5: 다중모드 리터러시 모델

Kosmos-2.5: A Multimodal Literate Model

초록

Support