ChatPaper.aiChatPaper

Éclair -- 通過整合閱讀順序提取文件內容和版面設計

Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents

February 6, 2025
作者: Ilia Karmanov, Amala Sanjay Deshmukh, Lukas Voegtle, Philipp Fischer, Kateryna Chumachenko, Timo Roman, Jarno Seppänen, Jupinder Parmar, Joseph Jennings, Andrew Tao, Karan Sapra
cs.AI

摘要

光學字符識別(OCR)技術被廣泛應用於從文件圖像中提取文本,促進高效的數字化和數據檢索。然而,僅僅提取文本在處理複雜文件時是不夠的。要充分理解這些文件,需要理解它們的結構,包括格式、公式、表格,以及跨多個頁面的多個區塊和列的閱讀順序,還需要語義信息來檢測諸如註腳和圖片標題等元素。這種全面的理解對於後續任務至關重要,例如檢索、文件問答以及為訓練大型語言模型(LLMs)和視覺語言模型(VLMs)進行數據整理。為了應對這一挑戰,我們介紹了「Éclair」,這是一個通用的文本提取工具,專門設計用於處理各種類型的文件。給定一個圖像,「Éclair」能夠按閱讀順序提取格式化文本,並提供邊界框及其對應的語義類別。為了全面評估這些新功能,我們引入了我們多樣化的人工標註基準,用於文件級OCR和語義分類。在這個基準上,「Éclair」實現了最先進的準確性,優於其他方法在關鍵指標上的表現。此外,我們還在已建立的基準上評估了「Éclair」,展示了它在多個評估標準上的多樣性和強大性。
English
Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.

Summary

AI-Generated Summary

PDF113February 12, 2025