MinerU2.5-Pro：データ中心型文書解析の限界を大規模に押し広げる

要旨

現行の文書解析手法は主にモデルアーキテクチャの革新で競合しているが、学習データの体系的なエンジニアリングは未だ十分に探究されていない。しかし、様々なアーキテクチャとパラメータ規模のSOTAモデルが、同一の難易度の高いサンプルセットに対して極めて一貫した失敗パターンを示すことから、性能のボトルネックはアーキテクチャそのものではなく、学習データに共通する欠陥に起因することが示唆される。この知見に基づき、我々は \mineru の12億パラメータアーキテクチャを完全に固定したまま、データエンジニアリングと学習戦略の最適化のみを通じて技術水準を革新する \minerupro を提案する。その中核には、網羅性、情報量、注釈精度を統合的に考慮して共同設計されたデータエンジンが存在する：多様性と難易度を考慮したサンプリングは、分布シフトを補正しながら学習データを1000万サンプル未満から6550万サンプルに拡大する。クロスモデル一貫性検証は、異種モデル間の出力一致を利用してサンプルの難易度を評価し、信頼性の高い注釈を生成する。Judge-and-Refineパイプラインは、レンダリングと検証を反復する修正により、難易度の高いサンプルの注釈品質を向上させる。3段階の段階的学習戦略――大規模事前学習、難易度の高いサンプルに対する微調整、GRPOアライメント――は、異なる品質階層のこれらのデータを順次活用する。評価面では、OmniDocBench~v1.5の要素マッチングバイアスを修正し、難易度の高いサブセットを導入することで、より識別力の高いOmniDocBench~v1.6プロトコルを確立した。アーキテクチャの変更を一切加えずに、\minerupro はOmniDocBench~v1.6で95.69を達成し、同一アーキテクチャのベースラインを2.71ポイント上回り、パラメータ数が200倍以上多いモデルを含む既存の全ての手法を凌駕する。

English

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200times more parameters.

MinerU2.5-Pro：データ中心型文書解析の限界を大規模に押し広げる

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

要旨

Support