エクレア -- ドキュメントの内容とレイアウトを統合された読み取り順序で抽出する

要旨

光学文字認識（OCR）技術は、文書の画像からテキストを抽出するため広く利用され、効率的なデジタル化とデータ検索を可能にしています。ただし、複雑な文書を扱う際には、単にテキストを抽出するだけでは不十分です。このような文書を完全に理解するには、フォーマット、数式、表、複数のページにわたる複数のブロックや列の読み取り順序、脚注や画像キャプションなどの要素を検出するための構造の理解が必要です。この包括的な理解は、検索、文書に関する質問への回答、大規模言語モデル（LLMs）やビジョン言語モデル（VLMs）のトレーニングのためのデータ整備など、下流タスクにとって重要です。この課題に対処するために、私たちは幅広い文書タイプを処理するために特に設計された汎用テキスト抽出ツールである「Éclair」を紹介します。画像が与えられると、Éclairは、読み取り順序でフォーマットされたテキストを抽出し、それに対応する境界ボックスとそれらの対応する意味クラスを取得できます。これらの新しい機能を徹底的に評価するために、文書レベルのOCRと意味分類のための多様な人手によるアノテーションベンチマークを紹介します。Éclairは、このベンチマークで最先端の精度を達成し、主要な指標において他の手法を凌駕しています。さらに、Éclairを確立されたベンチマークで評価し、その汎用性と強さを複数の評価基準にわたって示しています。

English

Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.