判読困難な手書き文書画像からの弱教師付き情報抽出

要旨

最先端の情報抽出手法はOCRエラーによって制限されています。これらの手法はフォーム形式の印刷文書では良好に機能しますが、非構造化の手書き文書は依然として課題となっています。既存のモデルをドメイン固有のトレーニングデータに適応させることは非常にコストがかかります。これは主に2つの要因によるものです：1）ドメイン固有の文書（手書きの処方箋や実験ノートなど）の入手可能性が限られていること、2）判読困難な手書き文書画像を解読するためにドメイン固有の知識が必要となるため、アノテーションがさらに困難になることです。本研究では、弱いラベル付けされたデータのみを使用して手書き処方箋から薬品名を抽出するという複雑な問題に焦点を当てます。データは画像とその中に含まれる薬品名のリストで構成されていますが、画像内での位置情報は含まれていません。この問題を解決するために、まず弱いラベルから関心領域（薬品行）を特定し、次に合成生成データのみを使用して学習したドメイン固有の薬品言語モデルを注入します。市販の最先端手法と比較して、本アプローチは処方箋からの薬品名抽出において2.5倍以上の性能向上を示しました。

English

State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions.

判読困難な手書き文書画像からの弱教師付き情報抽出

Weakly supervised information extraction from inscrutable handwritten document images

要旨

Support