ChatPaper.aiChatPaper

MinerU2.5-Pro:以數據為核心的文件解析技術在規模化應用中的極限突破

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

April 6, 2026
作者: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He
cs.AI

摘要

當前文件解析方法主要圍繞模型架構創新展開競爭,而訓練數據的系統性工程仍未被充分探索。然而,不同架構和參數規模的SOTA模型在同一組困難樣本上表現出高度一致的錯誤模式,這表明性能瓶頸源於訓練數據的共性缺陷而非架構本身。基於此發現,我們提出\minerupro——該方法僅通過數據工程和訓練策略優化實現技術突破,同時保持\mineru的12億參數架構完全固定。其核心是圍繞覆蓋度、信息量和標註準確性協同設計的數據引擎:多樣性與難度感知採樣將訓練數據從不足1000萬擴展至6550萬樣本,並修正分佈偏移;跨模型一致性驗證利用異構模型間的輸出共識評估樣本難度並生成可靠標註;審判-精煉管道通過渲染-驗證迭代修正提升困難樣本的標註質量。三階段漸進式訓練策略(大規模預訓練、困難樣本微調和GRPO對齊)依次利用不同質量層級的數據。在評估層面,我們修正OmniDocBench~v1.5中的元素匹配偏差並引入困難子集,建立區分度更高的OmniDocBench~v1.6評估協議。在未修改任何架構的情況下,\minerupro在OmniDocBench~v1.6上達到95.69分,較同架構基線提升2.71分,並超越包括參數量超過200倍模型在內的所有現有方法。
English
Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200times more parameters.
PDF892April 8, 2026