ChatPaper.aiChatPaper

MinerU2.5-Pro:以数据为中心的大规模文档解析技术突破极限

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

April 6, 2026
作者: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He
cs.AI

摘要

当前文档解析方法主要围绕模型架构创新展开竞争,而训练数据的系统性工程仍待深入探索。然而我们发现,不同架构和参数规模的SOTA模型在同一组困难样本上表现出高度一致的错误模式,这表明性能瓶颈源于训练数据的共同缺陷而非架构本身。基于此发现,我们提出\minerupro方法——在完全固定1.2B参数架构的前提下,仅通过数据工程与训练策略优化实现性能突破。其核心是围绕覆盖度、信息量与标注精度协同设计的数据引擎:多样性-难度感知采样将训练数据从不足1000万扩展至6550万样本并修正分布偏移;跨模型一致性验证利用异构模型的输出共识评估样本难度并生成可靠标注;判决-优化流水线通过"渲染-验证"迭代修正提升困难样本的标注质量。我们采用三阶段渐进式训练策略——大规模预训练、困难样本微调与GRPO对齐——依次利用不同质量层级的数据。在评估层面,我们修正了OmniDocBench~v1.5中的元素匹配偏差并引入困难子集,建立更具区分度的OmniDocBench~v1.6基准。在零架构改动条件下,\minerupro在OmniDocBench~v1.6上达到95.69分,较同架构基线提升2.71分,超越包括参数量超其200倍的所有现有方法。
English
Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200times more parameters.
PDF892April 8, 2026