トークンレベルのテキスト画像基盤モデルによる文書理解

要旨

近年、汎用視覚基盤モデル（VFMs）の採用が増加しており、特に多モーダル大規模言語モデル（MLLMs）の画像エンコーダーとして広く利用されています。しかし、意味的に細かい監督がなければ、これらのモデルは下流のテキスト画像関連タスク、すなわち小さく密集したテキストを含む画像の知覚、理解、推論において基本的な予測エラーに直面します。このギャップを埋めるため、我々はテキスト画像関連タスクに特化した初のトークンレベル視覚基盤モデル、TokenOCRを開発しました。これは、さまざまな伝統的な下流アプリケーションをサポートするように設計されています。TokenOCRの事前学習を促進するため、我々はまた、初のトークンレベル画像テキストデータセット、TokenITを構築する高品質なデータ生産パイプラインを考案しました。TokenITは2000万枚の画像と18億のトークン-マスクペアで構成されています。さらに、この優れた画像-テキスト能力を基盤として、我々は従来のVFMsをTokenOCRにシームレスに置き換え、VQAベースのドキュメント理解タスクのためのドキュメントレベルMLLM、TokenVLを構築しました。最後に、広範な実験により、TokenOCRとTokenVLの有効性が実証されました。コード、データセット、および重みはhttps://token-family.github.io/TokenOCR_projectで公開されます。

English

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://token-family.github.io/TokenOCR_project.