DocLayout-YOLO: 多様な合成データとグローバルからローカルへの適応的知覚を通じて文書レイアウト解析を強化する

要旨

実世界の文書理解システムにおいて、文書レイアウト解析は重要ですが、速度と精度の間で難しいトレードオフに直面します。テキストと視覚的特徴を両方活用する多モーダル手法は高い精度を達成しますが、著しい遅延に苦しむ一方、視覚的特徴にのみ依存する単一モーダル手法は、精度を犠牲にして処理速度を向上させます。このジレンマに対処するために、私たちはDocLayout-YOLOを導入します。この新しいアプローチは、事前トレーニングとモデル設計の両方で文書固有の最適化を通じて、速度の利点を維持しつつ精度を向上させます。堅牢な文書事前トレーニングのために、Mesh-candidate BestFitアルゴリズムを導入し、文書合成を2次元ビンパッキング問題としてフレーム化し、大規模かつ多様なDocSynth-300Kデータセットを生成します。DocSynth-300Kデータセットでの事前トレーニングは、さまざまな文書タイプにおけるファインチューニングのパフォーマンスを大幅に向上させます。モデルの最適化に関しては、Global-to-Local Controllable Receptive Moduleを提案し、文書要素の多様なスケール変動をより適切に処理できるようにします。さらに、異なる文書タイプにわたるパフォーマンスを検証するために、DocStructBenchという複雑で挑戦的なベンチマークを導入します。ダウンストリームデータセットでの包括的な実験により、DocLayout-YOLOが速度と精度の両方で優れていることが示されます。コード、データ、モデルはhttps://github.com/opendatalab/DocLayout-YOLOで入手可能です。

English

Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO.

DocLayout-YOLO: 多様な合成データとグローバルからローカルへの適応的知覚を通じて文書レイアウト解析を強化する

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

要旨

Support