ChatPaper.aiChatPaper

DocLayout-YOLO:通过多样化的合成数据和全局到局部的自适应感知增强文档布局分析

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

October 16, 2024
作者: Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He
cs.AI

摘要

文档布局分析对于现实世界的文档理解系统至关重要,但在速度和准确性之间存在一个具有挑战性的权衡:利用文本和视觉特征的多模态方法可以实现更高的准确性,但会遭受显著的延迟,而仅依赖视觉特征的单模态方法则在加快处理速度的同时牺牲了准确性。为了解决这一困境,我们引入了DocLayout-YOLO,这是一种新颖的方法,通过文档特定的优化在预训练和模型设计中提高准确性的同时保持速度优势。为了实现稳健的文档预训练,我们引入了Mesh-candidate BestFit算法,将文档合成框架化为二维装箱问题,生成了大规模、多样化的DocSynth-300K数据集。在生成的DocSynth-300K数据集上进行预训练显著提高了各种文档类型的微调性能。在模型优化方面,我们提出了一个全局到局部可控的感受模块,能够更好地处理文档元素的多尺度变化。此外,为了验证在不同文档类型上的性能,我们引入了一个复杂且具有挑战性的基准测试集,名为DocStructBench。在下游数据集上进行的大量实验表明,DocLayout-YOLO在速度和准确性方面表现出色。代码、数据和模型可在https://github.com/opendatalab/DocLayout-YOLO 上获取。
English
Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO.

Summary

AI-Generated Summary

PDF382November 16, 2024