从晦涩的手写文档图像中进行弱监督信息提取
Weakly supervised information extraction from inscrutable handwritten document images
June 12, 2023
作者: Sujoy Paul, Gagan Madan, Akankshya Mishra, Narayan Hegde, Pradeep Kumar, Gaurav Aggarwal
cs.AI
摘要
目前的信息提取方法受OCR错误的限制。它们在形式文件中的印刷文本方面表现良好,但对于无结构的手写文档仍然是一个挑战。将现有模型调整为特定领域的训练数据相当昂贵,原因有两点,1)特定领域文档的可用性有限(如手写处方、实验室笔记等),2)注释变得更加具有挑战性,因为需要特定领域知识来解码晦涩的手写文档图像。在这项工作中,我们专注于使用仅具有弱标记数据从手写处方中提取药物名称这一复杂问题。数据包括图像及其中的药物名称列表,但不包括它们在图像中的位置。我们通过首先从仅有弱标签中识别感兴趣的区域,即药物行,然后注入仅使用合成生成数据学习的特定领域药物语言模型来解决这个问题。与现成的最先进方法相比,我们的方法在从处方中提取药物名称方面表现提高了超过2.5倍。
English
State-of-the-art information extraction methods are limited by OCR errors.
They work well for printed text in form-like documents, but unstructured,
handwritten documents still remain a challenge. Adapting existing models to
domain-specific training data is quite expensive, because of two factors, 1)
limited availability of the domain-specific documents (such as handwritten
prescriptions, lab notes, etc.), and 2) annotations become even more
challenging as one needs domain-specific knowledge to decode inscrutable
handwritten document images. In this work, we focus on the complex problem of
extracting medicine names from handwritten prescriptions using only weakly
labeled data. The data consists of images along with the list of medicine names
in it, but not their location in the image. We solve the problem by first
identifying the regions of interest, i.e., medicine lines from just weak labels
and then injecting a domain-specific medicine language model learned using only
synthetically generated data. Compared to off-the-shelf state-of-the-art
methods, our approach performs >2.5x better in medicine names extraction from
prescriptions.