ロジック解析技術レポート

要旨

大規模視覚言語モデル（LVLM）の最近の進展により、文書解析タスクにおいて大きな進歩がもたらされました。従来のパイプライン型の手法と比較して、エンドツーエンドのパラダイムは、光学文字認識（OCR）、表認識、数式認識などを統合することで、PDF画像を構造化された出力に変換する優れた性能を示しています。しかし、文書レイアウトや読み順の明示的な解析段階が欠如しているため、LVLMは多段組新聞やポスターなどの複雑な文書タイプを扱う能力に制限があります。この制限に対処するため、本報告書では、強化学習を組み込んだエンドツーエンドのLVLMベースのモデルであるLogics-Parsingを提案します。私たちのモデルは、複雑なレイアウト解析と読み順推論を最適化するために、入念に設計された報酬メカニズムを組み込んでいます。さらに、化学式や手書きの漢字など多様なデータタイプを教師ありファインチューニングに取り入れることで、モデルの汎用性を拡張しています。最後に、提案手法の厳密な評価を可能にするため、9つの主要カテゴリと20以上のサブカテゴリにわたる1,078ページのPDF画像を厳選したLogicsParsingBenchを導入し、後日公開します。LogicsParsingBenchで実施した包括的な実験により、提案モデルの有効性と最先端（SOTA）の性能が、多様な文書解析シナリオで検証されました。プロジェクトページ：https://github.com/alibaba/Logics-Parsing

English

Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

ロジック解析技術レポート

Logics-Parsing Technical Report

要旨

Support