牛軋糖:學術文件的神經光學理解
Nougat: Neural Optical Understanding for Academic Documents
August 25, 2023
作者: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
cs.AI
摘要
科學知識主要存儲在書籍和科學期刊中,通常以 PDF 格式存在。然而,PDF 格式會導致語義信息的損失,特別是對於數學表達式。我們提出了Nougat(學術文檔的神經光學理解),這是一個視覺Transformer模型,用於執行光學字符識別(OCR)任務,將科學文檔處理為一種標記語言,並展示了我們的模型在一個新的科學文檔數據集上的有效性。所提出的方法提供了一個有望解決在數字時代增強科學知識可訪問性的方案,通過搭建人類可讀文檔和機器可讀文本之間的橋樑。我們釋放模型和代碼,以加速未來科學文本識別工作的進展。
English
Scientific knowledge is predominantly stored in books and scientific
journals, often in the form of PDFs. However, the PDF format leads to a loss of
semantic information, particularly for mathematical expressions. We propose
Nougat (Neural Optical Understanding for Academic Documents), a Visual
Transformer model that performs an Optical Character Recognition (OCR) task for
processing scientific documents into a markup language, and demonstrate the
effectiveness of our model on a new dataset of scientific documents. The
proposed approach offers a promising solution to enhance the accessibility of
scientific knowledge in the digital age, by bridging the gap between
human-readable documents and machine-readable text. We release the models and
code to accelerate future work on scientific text recognition.