MinerU2.5: 高解像度文書解析のための効率的な分離型視覚-言語モデル

要旨

本研究では、1.2Bパラメータの文書解析視覚言語モデルMinerU2.5を提案する。本モデルは、優れた計算効率を維持しつつ、最先端の認識精度を達成する。我々のアプローチは、粗から細への2段階解析戦略を採用し、大域的なレイアウト解析と局所的な内容認識を分離する。第1段階では、モデルはダウンサンプリングされた画像に対して効率的なレイアウト解析を行い、構造要素を特定することで、高解像度入力を処理する際の計算オーバーヘッドを回避する。第2段階では、大域的なレイアウトをガイドとして、元画像から抽出されたネイティブ解像度のクロップに対してターゲットを絞った内容認識を行い、密なテキスト、複雑な数式、表における微細な詳細を保持する。この戦略を支援するため、事前学習と微調整の両方に対応した多様で大規模なトレーニングコーパスを生成する包括的なデータエンジンを開発した。結果として、MinerU2.5は強力な文書解析能力を示し、複数のベンチマークにおいて最先端の性能を達成し、様々な認識タスクにおいて汎用モデルやドメイン固有モデルを凌駕しつつ、大幅に低い計算オーバーヘッドを維持する。

English

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

MinerU2.5: 高解像度文書解析のための効率的な分離型視覚-言語モデル

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

要旨

Support