불분명한 필기체 문서 이미지로부터의 약한 감독 정보 추출

초록

최첨단 정보 추출 방법은 OCR 오류에 의해 제한됩니다. 이러한 방법은 양식 문서의 인쇄된 텍스트에는 잘 작동하지만, 비정형의 손글씨 문서는 여전히 해결해야 할 과제로 남아 있습니다. 기존 모델을 도메인 특화 학습 데이터에 적응시키는 것은 두 가지 요인으로 인해 상당히 비용이 많이 듭니다. 첫째, 도메인 특화 문서(예: 손글씨 처방전, 실험 노트 등)의 제한된 가용성, 둘째, 난해한 손글씨 문서 이미지를 해독하기 위해 도메인 특화 지식이 필요하므로 주석 작업이 더욱 어려워진다는 점입니다. 본 연구에서는 약한 레이블 데이터만을 사용하여 손글씨 처방전에서 약물 이름을 추출하는 복잡한 문제에 초점을 맞춥니다. 데이터는 이미지와 그 안에 포함된 약물 이름 목록으로 구성되지만, 이미지 내 위치 정보는 포함되지 않습니다. 우리는 먼저 약한 레이블만을 사용하여 관심 영역, 즉 약물 라인을 식별한 다음, 합성 데이터만을 사용하여 학습된 도메인 특화 약물 언어 모델을 주입함으로써 이 문제를 해결합니다. 기존의 최첨단 방법과 비교하여, 우리의 접근 방식은 처방전에서 약물 이름 추출 성능이 2.5배 이상 더 우수합니다.

English

State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions.

불분명한 필기체 문서 이미지로부터의 약한 감독 정보 추출

Weakly supervised information extraction from inscrutable handwritten document images

초록

Support