逻辑解析技术报告

摘要

近期，大规模视觉语言模型（LVLM）的进展推动了文档解析任务的重要突破。相较于传统的基于流水线的方法，端到端范式在将PDF图像转换为结构化输出方面展现了卓越性能，这得益于其集成了光学字符识别（OCR）、表格识别、数学公式识别等多种功能。然而，由于缺乏对文档布局和阅读顺序的显式分析阶段，LVLM在处理多栏报纸或海报等复杂文档类型时能力受限。为解决这一局限，本报告提出Logics-Parsing：一种基于LVLM并增强强化学习的端到端模型。我们的模型通过精心设计的奖励机制，优化了复杂布局分析和阅读顺序推理。此外，通过将化学公式和手写汉字等多种数据类型纳入监督微调，我们进一步扩展了模型的通用性。最后，为严谨评估我们的方法，我们引入了LogicsParsingBench，这是一个精心策划的包含1,078页PDF图像的数据集，涵盖九大类别及二十多个子类别，该数据集将于后续发布。在LogicsParsingBench上进行的全面实验验证了所提模型在多样化文档分析场景中的有效性和最先进（SOTA）性能。项目页面：https://github.com/alibaba/Logics-Parsing

English

Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

逻辑解析技术报告

Logics-Parsing Technical Report

摘要

Support