Kosmos-2.5：一個多模式文學模型

摘要

我們提出了Kosmos-2.5，這是一個用於機器閱讀以文本為主的圖像的多模式文學模型。在大規模文本為主的圖像上預訓練的Kosmos-2.5在兩個不同但互補的轉錄任務中表現出色：(1)生成具有空間感知的文本塊，其中每個文本塊被分配其在圖像中的空間坐標，以及(2)生成以Markdown格式捕捉風格和結構的結構化文本輸出。這種統一的多模式文學能力是通過共享的Transformer架構、任務特定提示和靈活的文本表示實現的。我們在端到端文檔級文本識別和圖像到Markdown文本生成上評估了Kosmos-2.5。此外，通過監督微調，該模型可以輕鬆適應不同提示的任何文本為主的圖像理解任務，使其成為涉及文本豐富圖像的現實應用的通用工具。這項工作還為未來多模式大型語言模型的擴展鋪平了道路。

English

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

Kosmos-2.5：一個多模式文學模型

Kosmos-2.5: A Multimodal Literate Model

摘要

Support