Kosmos-2.5：一个多模态文学模型

摘要

我们介绍了Kosmos-2.5，这是一个用于机器阅读文本密集图像的多模态文学模型。在大规模文本密集图像上进行预训练后，Kosmos-2.5在两个不同但相互合作的转录任务中表现出色：（1）生成具有空间感知的文本块，其中每个文本块被分配其在图像中的空间坐标，以及（2）生成捕捉样式和结构的结构化文本输出，以Markdown格式呈现。这种统一的多模态文学能力是通过共享Transformer架构、任务特定提示和灵活的文本表示实现的。我们对Kosmos-2.5进行了端到端文档级文本识别和图像到Markdown文本生成的评估。此外，该模型可以通过监督微调轻松适应具有不同提示的任何文本密集图像理解任务，使其成为涉及文本丰富图像的实际应用的通用工具。这项工作还为未来多模态大型语言模型的扩展铺平了道路。

English

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

Kosmos-2.5：一个多模态文学模型

Kosmos-2.5: A Multimodal Literate Model

摘要

Support