從難以理解的手寫文件圖像中進行弱監督信息提取

摘要

目前最先進的資訊提取方法受到OCR錯誤的限制。這些方法在類似表格文件中的印刷文字上表現良好，但對於非結構化的手寫文件仍然是一個挑戰。將現有模型適應到特定領域的訓練數據相當昂貴，原因在於兩個因素，一是特定領域文件的有限可用性（如手寫處方、實驗室筆記等），二是注釋變得更加困難，因為需要特定領域知識來解碼難以理解的手寫文件圖像。在這項工作中，我們專注於使用僅具弱標籤數據從手寫處方中提取藥品名稱的複雜問題。數據包括圖像以及其中的藥品名稱列表，但不包括它們在圖像中的位置。我們通過首先從僅具弱標籤識別感興趣的區域，即藥品行，然後注入僅使用合成生成數據學習的特定領域藥品語言模型來解決問題。與現成的最先進方法相比，我們的方法在從處方中提取藥品名稱方面表現提高了超過2.5倍。

English

State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions.

從難以理解的手寫文件圖像中進行弱監督信息提取

Weakly supervised information extraction from inscrutable handwritten document images

摘要

Support