逻辑解析技术报告
Logics-Parsing Technical Report
September 24, 2025
作者: Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, Minggang Wu
cs.AI
摘要
近期,大规模视觉语言模型(LVLM)的进展推动了文档解析任务的重要突破。相较于传统的基于流水线的方法,端到端范式在将PDF图像转换为结构化输出方面展现了卓越性能,这得益于其集成了光学字符识别(OCR)、表格识别、数学公式识别等多种功能。然而,由于缺乏对文档布局和阅读顺序的显式分析阶段,LVLM在处理多栏报纸或海报等复杂文档类型时能力受限。为解决这一局限,本报告提出Logics-Parsing:一种基于LVLM并增强强化学习的端到端模型。我们的模型通过精心设计的奖励机制,优化了复杂布局分析和阅读顺序推理。此外,通过将化学公式和手写汉字等多种数据类型纳入监督微调,我们进一步扩展了模型的通用性。最后,为严谨评估我们的方法,我们引入了LogicsParsingBench,这是一个精心策划的包含1,078页PDF图像的数据集,涵盖九大类别及二十多个子类别,该数据集将于后续发布。在LogicsParsingBench上进行的全面实验验证了所提模型在多样化文档分析场景中的有效性和最先进(SOTA)性能。项目页面:https://github.com/alibaba/Logics-Parsing
English
Recent advances in Large Vision-Language models (LVLM) have spurred
significant progress in document parsing task. Compared to traditional
pipeline-based methods, end-to-end paradigms have shown their excellence in
converting PDF images into structured outputs through integrated Optical
Character Recognition (OCR), table recognition, mathematical formula
recognition and so on. However, the absence of explicit analytical stages for
document layouts and reading orders limits the LVLM's capability in handling
complex document types such as multi-column newspapers or posters. To address
this limitation, we propose in this report Logics-Parsing: an end-to-end
LVLM-based model augmented with reinforcement learning. Our model incorporates
meticulously designed reward mechanisms to optimize complex layout analysis and
reading order inference. In addition, we expand the model's versatility by
incorporating diverse data types such as chemical formulas and handwritten
Chinese characters into supervised fine-tuning. Finally, to enable rigorous
evaluation of our approach, we introduce LogicsParsingBench, a curated set of
1,078 page-level PDF images spanning nine major categories and over twenty
sub-categories, which will be released later. Comprehensive experiments
conducted on LogicsParsingBench have validated the efficacy and
State-of-the-art (SOTA) performance of our proposed model across diverse
document analysis scenarios. Project Page:
https://github.com/alibaba/Logics-Parsing