ChatPaper.aiChatPaper

CommonForms:一个用于表单字段检测的大规模多样化数据集

CommonForms: A Large, Diverse Dataset for Form Field Detection

September 20, 2025
作者: Joe Barrow
cs.AI

摘要

本文介绍了CommonForms,一个用于表单字段检测的网络规模数据集。该研究将表单字段检测问题转化为目标检测任务:给定页面图像,预测表单字段的位置和类型(文本输入、选择按钮、签名)。数据集通过筛选Common Crawl中具有可填写元素的PDF文件构建而成。从800万份文档出发,经过筛选过程最终得到约55,000份文档,包含超过450,000页。分析显示,该数据集涵盖了多种语言和领域;其中三分之一的页面为非英语内容,在14个分类领域中,没有任何一个领域占数据集总量的25%以上。 此外,本文提出了一系列表单字段检测器——FFDNet-Small和FFDNet-Large,它们在CommonForms测试集上达到了极高的平均精度。每个模型的训练成本均低于500美元。消融实验结果表明,高分辨率输入对于高质量的表单字段检测至关重要,且清洗过程相较于直接使用Common Crawl中所有含可填写字段的PDF文件,显著提高了数据效率。定性分析显示,这些模型在性能上超越了市面上流行的、具备表单准备功能的PDF阅读器。与最受欢迎的商用解决方案不同,FFDNet不仅能预测文本和签名字段,还能预测复选框。据我们所知,这是首个公开发布的大规模表单字段检测数据集,同时也是首个开源模型。数据集、模型及代码将在https://github.com/jbarrow/commonforms 发布。
English
This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms
PDF22September 24, 2025