UniHDSA:一种面向层次化文档结构分析的统一关系预测方法
UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis
March 20, 2025
作者: Jiawei Wang, Kai Hu, Qiang Huo
cs.AI
摘要
文档结构分析,亦称文档布局分析,对于理解文档的物理布局与逻辑结构至关重要,服务于信息检索、文档摘要、知识提取等任务。层次化文档结构分析(HDSA)特别致力于恢复使用具有层次化模式的创作软件所创建文档的层次结构。以往研究主要遵循两种路径:一是孤立地处理HDSA的特定子任务,如表格检测或阅读顺序预测;二是采用统一框架,通过多个分支或模块分别应对不同任务。在本研究中,我们提出了一种针对HDSA的统一关系预测方法,命名为UniHDSA,它将各类HDSA子任务视为关系预测问题,并将关系预测标签整合至统一标签空间。这使得单一关系预测模块能够同时处理多项任务,无论是页面级还是文档级结构分析。为验证UniHDSA的有效性,我们开发了一个基于Transformer架构的多模态端到端系统。大量实验结果表明,我们的方法在层次化文档结构分析基准Comp-HRDoc上达到了最先进的性能,并在大规模文档布局分析数据集DocLayNet上取得了竞争性成绩,充分展示了该方法在所有子任务上的优越性。Comp-HRDoc基准及UniHDSA的配置已公开于https://github.com/microsoft/CompHRDoc。
English
Document structure analysis, aka document layout analysis, is crucial for
understanding both the physical layout and logical structure of documents,
serving information retrieval, document summarization, knowledge extraction,
etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to
restore the hierarchical structure of documents created using authoring
software with hierarchical schemas. Previous research has primarily followed
two approaches: one focuses on tackling specific subtasks of HDSA in isolation,
such as table detection or reading order prediction, while the other adopts a
unified framework that uses multiple branches or modules, each designed to
address a distinct task. In this work, we propose a unified relation prediction
approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as
relation prediction problems and consolidates relation prediction labels into a
unified label space. This allows a single relation prediction module to handle
multiple tasks simultaneously, whether at a page-level or document-level
structure analysis. To validate the effectiveness of UniHDSA, we develop a
multimodal end-to-end system based on Transformer architectures. Extensive
experimental results demonstrate that our approach achieves state-of-the-art
performance on a hierarchical document structure analysis benchmark,
Comp-HRDoc, and competitive results on a large-scale document layout analysis
dataset, DocLayNet, effectively illustrating the superiority of our method
across all sub-tasks. The Comp-HRDoc benchmark and UniHDSA's configurations are
publicly available at https://github.com/microsoft/CompHRDoc.