牛軋糖：學術文件的神經光學理解

摘要

科學知識主要存儲在書籍和科學期刊中，通常以 PDF 格式存在。然而，PDF 格式會導致語義信息的損失，特別是對於數學表達式。我們提出了Nougat（學術文檔的神經光學理解），這是一個視覺Transformer模型，用於執行光學字符識別（OCR）任務，將科學文檔處理為一種標記語言，並展示了我們的模型在一個新的科學文檔數據集上的有效性。所提出的方法提供了一個有望解決在數字時代增強科學知識可訪問性的方案，通過搭建人類可讀文檔和機器可讀文本之間的橋樑。我們釋放模型和代碼，以加速未來科學文本識別工作的進展。

English

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

牛軋糖：學術文件的神經光學理解

Nougat: Neural Optical Understanding for Academic Documents

摘要

Support