邏輯解析技術報告

摘要

近期，大型视觉语言模型（LVLM）的进展极大地推动了文档解析任务的发展。相较于传统的基于流水线的方法，端到端范式在将PDF图像转换为结构化输出方面展现了卓越性能，这得益于其集成了光学字符识别（OCR）、表格识别、数学公式识别等多种功能。然而，由于缺乏对文档布局和阅读顺序的显式分析阶段，LVLM在处理多栏报纸或海报等复杂文档类型时能力受限。为解决这一局限，本报告提出了一种名为Logics-Parsing的端到端LVLM模型，该模型通过强化学习进行增强。我们的模型精心设计了奖励机制，以优化复杂的布局分析和阅读顺序推理。此外，通过将化学公式和手写汉字等多种数据类型纳入监督微调，我们进一步扩展了模型的通用性。最后，为了对我们的方法进行严格评估，我们引入了LogicsParsingBench，这是一个精心策划的包含1,078页PDF图像的数据集，涵盖九大类别和超过二十个子类别，该数据集将于后续发布。在LogicsParsingBench上进行的全面实验验证了我们提出模型在多种文档分析场景中的有效性和最先进（SOTA）性能。项目页面：https://github.com/alibaba/Logics-Parsing

English

Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

邏輯解析技術報告

Logics-Parsing Technical Report

摘要

Support