牛轧糖:学术文件的神经光学理解
Nougat: Neural Optical Understanding for Academic Documents
August 25, 2023
作者: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic
cs.AI
摘要
科学知识主要存储在书籍和科学期刊中,通常以PDF形式存在。然而,PDF格式会导致语义信息的丢失,特别是对数学表达式而言。我们提出了Nougat(学术文档的神经光学理解),这是一个视觉Transformer模型,用于执行光学字符识别(OCR)任务,将科学文档处理成标记语言,并展示了我们模型在一组新的科学文档数据集上的有效性。所提出的方法为增强数字时代科学知识的可访问性提供了一个有前途的解决方案,通过弥合人类可读文档和机器可读文本之间的差距。我们发布了模型和代码,以加速未来科学文本识别工作的进展。
English
Scientific knowledge is predominantly stored in books and scientific
journals, often in the form of PDFs. However, the PDF format leads to a loss of
semantic information, particularly for mathematical expressions. We propose
Nougat (Neural Optical Understanding for Academic Documents), a Visual
Transformer model that performs an Optical Character Recognition (OCR) task for
processing scientific documents into a markup language, and demonstrate the
effectiveness of our model on a new dataset of scientific documents. The
proposed approach offers a promising solution to enhance the accessibility of
scientific knowledge in the digital age, by bridging the gap between
human-readable documents and machine-readable text. We release the models and
code to accelerate future work on scientific text recognition.