Kosmos-2.5:一个多模态文学模型
Kosmos-2.5: A Multimodal Literate Model
September 20, 2023
作者: Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei
cs.AI
摘要
我们介绍了Kosmos-2.5,这是一个用于机器阅读文本密集图像的多模态文学模型。在大规模文本密集图像上进行预训练后,Kosmos-2.5在两个不同但相互合作的转录任务中表现出色:(1)生成具有空间感知的文本块,其中每个文本块被分配其在图像中的空间坐标,以及(2)生成捕捉样式和结构的结构化文本输出,以Markdown格式呈现。这种统一的多模态文学能力是通过共享Transformer架构、任务特定提示和灵活的文本表示实现的。我们对Kosmos-2.5进行了端到端文档级文本识别和图像到Markdown文本生成的评估。此外,该模型可以通过监督微调轻松适应具有不同提示的任何文本密集图像理解任务,使其成为涉及文本丰富图像的实际应用的通用工具。这项工作还为未来多模态大型语言模型的扩展铺平了道路。
English
We present Kosmos-2.5, a multimodal literate model for machine reading of
text-intensive images. Pre-trained on large-scale text-intensive images,
Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1)
generating spatially-aware text blocks, where each block of text is assigned
its spatial coordinates within the image, and (2) producing structured text
output that captures styles and structures into the markdown format. This
unified multimodal literate capability is achieved through a shared Transformer
architecture, task-specific prompts, and flexible text representations. We
evaluate Kosmos-2.5 on end-to-end document-level text recognition and
image-to-markdown text generation. Furthermore, the model can be readily
adapted for any text-intensive image understanding task with different prompts
through supervised fine-tuning, making it a general-purpose tool for real-world
applications involving text-rich images. This work also paves the way for the
future scaling of multimodal large language models.