ChatPaper.aiChatPaper

CommonForms:一個用於表單欄位偵測的大型多樣化資料集

CommonForms: A Large, Diverse Dataset for Form Field Detection

September 20, 2025
作者: Joe Barrow
cs.AI

摘要

本文介紹了CommonForms,一個用於表單欄位檢測的網絡規模數據集。該研究將表單欄位檢測問題視為目標檢測任務:給定頁面圖像,預測表單欄位的位置和類型(文本輸入、選擇按鈕、簽名)。數據集通過過濾Common Crawl來構建,以尋找包含可填寫元素的PDF文件。從800萬份文檔開始,經過過濾過程最終得到約55,000份文檔的數據集,這些文檔包含超過450,000頁。分析顯示,該數據集涵蓋了多種語言和領域的混合;三分之一的頁面為非英語,在14個分類領域中,沒有任何一個領域佔據數據集的25%以上。 此外,本文提出了一系列表單欄位檢測器,FFDNet-Small和FFDNet-Large,它們在CommonForms測試集上達到了非常高的平均精度。每個模型的訓練成本低於500美元。消融實驗結果表明,高分辨率輸入對於高質量的表單欄位檢測至關重要,並且清理過程相比直接使用Common Crawl中所有包含可填寫欄位的PDF文件,提高了數據效率。定性分析顯示,這些模型在性能上超越了市面上流行的、能夠處理表單的PDF閱讀器。與市面上最流行的商業解決方案不同,FFDNet除了能夠預測文本和簽名欄位外,還能預測複選框。據我們所知,這是首個針對表單欄位檢測發布的大規模數據集,也是首個開源模型。數據集、模型和代碼將發佈於https://github.com/jbarrow/commonforms。
English
This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms
PDF22September 24, 2025