MinerU2.5：一种解耦的视觉-语言模型，用于高效的高分辨率文档解析

摘要

我们推出MinerU2.5，这是一个拥有12亿参数的文档解析视觉语言模型，在保持卓越计算效率的同时，实现了最先进的识别精度。我们的方法采用了一种由粗到细的两阶段解析策略，将全局布局分析与局部内容识别解耦。在第一阶段，模型对下采样图像进行高效的布局分析，以识别结构元素，从而避免了处理高分辨率输入带来的计算开销。在第二阶段，在全局布局的指导下，模型对从原始图像中提取的原分辨率裁剪区域进行针对性内容识别，保留了密集文本、复杂公式和表格中的精细细节。为了支持这一策略，我们开发了一个全面的数据引擎，为预训练和微调生成了多样化的大规模训练语料库。最终，MinerU2.5展现了强大的文档解析能力，在多个基准测试中实现了最先进的性能，在各种识别任务上超越了通用模型和领域专用模型，同时保持了显著更低的计算开销。

English

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

MinerU2.5：一种解耦的视觉-语言模型，用于高效的高分辨率文档解析

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

摘要

Support