Kosmos-2.5:一個多模式文學模型
Kosmos-2.5: A Multimodal Literate Model
September 20, 2023
作者: Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei
cs.AI
摘要
我們提出了Kosmos-2.5,這是一個用於機器閱讀以文本為主的圖像的多模式文學模型。在大規模文本為主的圖像上預訓練的Kosmos-2.5在兩個不同但互補的轉錄任務中表現出色:(1)生成具有空間感知的文本塊,其中每個文本塊被分配其在圖像中的空間坐標,以及(2)生成以Markdown格式捕捉風格和結構的結構化文本輸出。這種統一的多模式文學能力是通過共享的Transformer架構、任務特定提示和靈活的文本表示實現的。我們在端到端文檔級文本識別和圖像到Markdown文本生成上評估了Kosmos-2.5。此外,通過監督微調,該模型可以輕鬆適應不同提示的任何文本為主的圖像理解任務,使其成為涉及文本豐富圖像的現實應用的通用工具。這項工作還為未來多模式大型語言模型的擴展鋪平了道路。
English
We present Kosmos-2.5, a multimodal literate model for machine reading of
text-intensive images. Pre-trained on large-scale text-intensive images,
Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1)
generating spatially-aware text blocks, where each block of text is assigned
its spatial coordinates within the image, and (2) producing structured text
output that captures styles and structures into the markdown format. This
unified multimodal literate capability is achieved through a shared Transformer
architecture, task-specific prompts, and flexible text representations. We
evaluate Kosmos-2.5 on end-to-end document-level text recognition and
image-to-markdown text generation. Furthermore, the model can be readily
adapted for any text-intensive image understanding task with different prompts
through supervised fine-tuning, making it a general-purpose tool for real-world
applications involving text-rich images. This work also paves the way for the
future scaling of multimodal large language models.