面向文檔理解的詞元級文本圖像基礎模型
A Token-level Text Image Foundation Model for Document Understanding
March 4, 2025
作者: Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang
cs.AI
摘要
近年來,通用視覺基礎模型(VFMs)的應用日益廣泛,尤其是在作為熱門多模態大型語言模型(MLLMs)的圖像編碼器方面。然而,在缺乏語義細粒度監督的情況下,這些模型在下游與文本圖像相關的任務中仍會遇到基本預測錯誤,即對包含細小密集文本的圖像進行感知、理解與推理時出現的問題。為彌補這一差距,我們開發了TokenOCR,這是首個專為文本圖像相關任務設計的令牌級視覺基礎模型,旨在支持多種傳統下游應用。為促進TokenOCR的預訓練,我們還設計了一個高質量的數據生成流程,構建了首個令牌級圖像文本數據集TokenIT,包含2000萬張圖像和18億個令牌-掩碼對。此外,利用這一具備卓越圖像即文本能力的基礎,我們無縫替換了先前的VFMs,構建了面向基於VQA的文檔理解任務的文檔級MLLM——TokenVL。最終,大量實驗證明了TokenOCR與TokenVL的有效性。代碼、數據集及權重將在https://token-family.github.io/TokenOCR_project上公開。
English
In recent years, general visual foundation models (VFMs) have witnessed
increasing adoption, particularly as image encoders for popular multi-modal
large language models (MLLMs). However, without semantically fine-grained
supervision, these models still encounter fundamental prediction errors in the
context of downstream text-image-related tasks, i.e., perception, understanding
and reasoning with images containing small and dense texts. To bridge this gap,
we develop TokenOCR, the first token-level visual foundation model specifically
tailored for text-image-related tasks, designed to support a variety of
traditional downstream applications. To facilitate the pretraining of TokenOCR,
we also devise a high-quality data production pipeline that constructs the
first token-level image text dataset, TokenIT, comprising 20 million images and
1.8 billion token-mask pairs. Furthermore, leveraging this foundation with
exceptional image-as-text capability, we seamlessly replace previous VFMs with
TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document
understanding tasks. Finally, extensive experiments demonstrate the
effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be
available at https://token-family.github.io/TokenOCR_project.Summary
AI-Generated Summary