CommonForms: 폼 필드 탐지를 위한 대규모 다양성 데이터셋

초록

본 논문은 폼 필드 감지를 위한 웹 스케일 데이터셋인 CommonForms를 소개한다. 이 연구는 폼 필드 감지 문제를 객체 감지 문제로 재구성한다: 페이지 이미지가 주어졌을 때, 폼 필드의 위치와 유형(텍스트 입력, 선택 버튼, 서명)을 예측하는 것이다. 이 데이터셋은 Common Crawl에서 채울 수 있는 요소가 있는 PDF를 필터링하여 구축되었다. 800만 개의 문서로 시작하여, 필터링 과정을 통해 최종적으로 약 55,000개의 문서와 450,000페이지 이상을 포함하는 데이터셋을 확보했다. 분석 결과, 이 데이터셋은 다양한 언어와 도메인을 포함하고 있으며, 페이지의 1/3은 비영어권이며, 14개의 분류된 도메인 중 어느 도메인도 데이터셋의 25% 이상을 차지하지 않는다. 또한, 본 논문은 CommonForms 테스트 세트에서 매우 높은 평균 정밀도를 달성한 폼 필드 감지기 패밀리인 FFDNet-Small과 FFDNet-Large를 제시한다. 각 모델의 학습 비용은 500달러 미만이다. 제거 실험 결과, 고해상도 입력은 고품질 폼 필드 감지에 매우 중요하며, 클리닝 과정은 Common Crawl에서 채울 수 있는 모든 PDF를 사용하는 것보다 데이터 효율성을 향상시킨다. 정성적 분석 결과, 이 모델들은 폼을 준비할 수 있는 상용 PDF 리더보다 우수한 성능을 보인다. 가장 인기 있는 상용 솔루션과 달리, FFDNet은 텍스트와 서명 필드 외에도 체크박스를 예측할 수 있다. 우리가 아는 한, 이는 폼 필드 감지를 위해 공개된 첫 번째 대규모 데이터셋이자 첫 번째 오픈 소스 모델이다. 데이터셋, 모델, 코드는 https://github.com/jbarrow/commonforms에서 공개될 예정이다.

English

This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms

CommonForms: 폼 필드 탐지를 위한 대규모 다양성 데이터셋

CommonForms: A Large, Diverse Dataset for Form Field Detection

초록

Support