MinerU2.5:一种解耦的视觉-语言模型,用于高效的高分辨率文档解析
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
September 26, 2025
作者: Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He
cs.AI
摘要
我们推出MinerU2.5,这是一个拥有12亿参数的文档解析视觉语言模型,在保持卓越计算效率的同时,实现了最先进的识别精度。我们的方法采用了一种由粗到细的两阶段解析策略,将全局布局分析与局部内容识别解耦。在第一阶段,模型对下采样图像进行高效的布局分析,以识别结构元素,从而避免了处理高分辨率输入带来的计算开销。在第二阶段,在全局布局的指导下,模型对从原始图像中提取的原分辨率裁剪区域进行针对性内容识别,保留了密集文本、复杂公式和表格中的精细细节。为了支持这一策略,我们开发了一个全面的数据引擎,为预训练和微调生成了多样化的大规模训练语料库。最终,MinerU2.5展现了强大的文档解析能力,在多个基准测试中实现了最先进的性能,在各种识别任务上超越了通用模型和领域专用模型,同时保持了显著更低的计算开销。
English
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language
model that achieves state-of-the-art recognition accuracy while maintaining
exceptional computational efficiency. Our approach employs a coarse-to-fine,
two-stage parsing strategy that decouples global layout analysis from local
content recognition. In the first stage, the model performs efficient layout
analysis on downsampled images to identify structural elements, circumventing
the computational overhead of processing high-resolution inputs. In the second
stage, guided by the global layout, it performs targeted content recognition on
native-resolution crops extracted from the original image, preserving
fine-grained details in dense text, complex formulas, and tables. To support
this strategy, we developed a comprehensive data engine that generates diverse,
large-scale training corpora for both pretraining and fine-tuning. Ultimately,
MinerU2.5 demonstrates strong document parsing ability, achieving
state-of-the-art performance on multiple benchmarks, surpassing both
general-purpose and domain-specific models across various recognition tasks,
while maintaining significantly lower computational overhead.