邏輯解析技術報告
Logics-Parsing Technical Report
September 24, 2025
作者: Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, Minggang Wu
cs.AI
摘要
近期,大型视觉语言模型(LVLM)的进展极大地推动了文档解析任务的发展。相较于传统的基于流水线的方法,端到端范式在将PDF图像转换为结构化输出方面展现了卓越性能,这得益于其集成了光学字符识别(OCR)、表格识别、数学公式识别等多种功能。然而,由于缺乏对文档布局和阅读顺序的显式分析阶段,LVLM在处理多栏报纸或海报等复杂文档类型时能力受限。为解决这一局限,本报告提出了一种名为Logics-Parsing的端到端LVLM模型,该模型通过强化学习进行增强。我们的模型精心设计了奖励机制,以优化复杂的布局分析和阅读顺序推理。此外,通过将化学公式和手写汉字等多种数据类型纳入监督微调,我们进一步扩展了模型的通用性。最后,为了对我们的方法进行严格评估,我们引入了LogicsParsingBench,这是一个精心策划的包含1,078页PDF图像的数据集,涵盖九大类别和超过二十个子类别,该数据集将于后续发布。在LogicsParsingBench上进行的全面实验验证了我们提出模型在多种文档分析场景中的有效性和最先进(SOTA)性能。项目页面:https://github.com/alibaba/Logics-Parsing
English
Recent advances in Large Vision-Language models (LVLM) have spurred
significant progress in document parsing task. Compared to traditional
pipeline-based methods, end-to-end paradigms have shown their excellence in
converting PDF images into structured outputs through integrated Optical
Character Recognition (OCR), table recognition, mathematical formula
recognition and so on. However, the absence of explicit analytical stages for
document layouts and reading orders limits the LVLM's capability in handling
complex document types such as multi-column newspapers or posters. To address
this limitation, we propose in this report Logics-Parsing: an end-to-end
LVLM-based model augmented with reinforcement learning. Our model incorporates
meticulously designed reward mechanisms to optimize complex layout analysis and
reading order inference. In addition, we expand the model's versatility by
incorporating diverse data types such as chemical formulas and handwritten
Chinese characters into supervised fine-tuning. Finally, to enable rigorous
evaluation of our approach, we introduce LogicsParsingBench, a curated set of
1,078 page-level PDF images spanning nine major categories and over twenty
sub-categories, which will be released later. Comprehensive experiments
conducted on LogicsParsingBench have validated the efficacy and
State-of-the-art (SOTA) performance of our proposed model across diverse
document analysis scenarios. Project Page:
https://github.com/alibaba/Logics-Parsing