CommonForms: フォームフィールド検出のための大規模で多様なデータセット

要旨

本論文では、フォームフィールド検出のための大規模ウェブデータセットであるCommonFormsを紹介する。フォームフィールド検出の問題を物体検出として定式化する：ページの画像が与えられたとき、フォームフィールドの位置とタイプ（テキスト入力、選択ボタン、署名）を予測する。このデータセットは、Common Crawlをフィルタリングして記入可能な要素を持つPDFを見つけることで構築された。800万の文書から始め、フィルタリングプロセスを経て、最終的に約55,000の文書（450,000ページ以上）からなるデータセットが得られた。分析によると、このデータセットには多様な言語とドメインが含まれており、ページの3分の1は非英語であり、14の分類されたドメインのうち、どのドメインもデータセットの25%以上を占めていない。さらに、本論文では、CommonFormsテストセットで非常に高い平均精度を達成するフォームフィールド検出器のファミリー、FFDNet-SmallとFFDNet-Largeを提示する。各モデルのトレーニングコストは500ドル未満である。アブレーション結果は、高品質なフォームフィールド検出には高解像度の入力が重要であり、Common Crawl内の記入可能なフィールドを持つすべてのPDFを使用するよりも、クリーニングプロセスがデータ効率を向上させることを示している。定性分析によると、これらのモデルは、フォームを準備できる人気のある商用PDFリーダーを上回る性能を示す。最も人気のある商用ソリューションとは異なり、FFDNetはテキストや署名フィールドに加えてチェックボックスも予測できる。私たちの知る限り、これはフォームフィールド検出のための最初の大規模データセットであり、また最初のオープンソースモデルでもある。データセット、モデル、コードはhttps://github.com/jbarrow/commonformsで公開される予定である。

English

This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms

CommonForms: フォームフィールド検出のための大規模で多様なデータセット

CommonForms: A Large, Diverse Dataset for Form Field Detection

要旨

Support