MinerU2.5-Pro: 데이터 중심 문서 구문 분석의 한계를 대규모로 확장하기

초록

현재 문서 파싱 방법론은 주로 모델 아키텍처 혁신을 통해 경쟁하며, 훈련 데이터의 체계적인 엔지니어링은 아직 충분히 탐구되지 않고 있습니다. 그러나 서로 다른 아키텍처와 매개변수 규모를 가진 SOTA(State-Of-The-Art) 모델들이 동일한 난이도 샘플 집합에서 매우 일관된 실패 패턴을 보인다는 점은 성능 병목 현상이 아키텍처 자체보다 훈련 데이터의 공통된 결함에서 비롯됨을 시사합니다. 이러한 발견을 바탕으로, 우리는 \mineru의 12억 매개변수 아키텍처를 완전히 고정한 상태에서 데이터 엔지니어링과 훈련 전략 최적화만으로 기술 수준을 향상시킨 \minerupro를 제시합니다. 그 핵심은 커버리지, 정보성, 주석 정확도를 중심으로 공동 설계된 데이터 엔진입니다: 다양성 및 난이도 인식 샘플링은 분포 편차를 수정하면서 훈련 데이터를 1,000만 개 미만에서 6,550만 개 샘플로 확장합니다; 교차 모델 일관성 검증은 이질적 모델 간의 출력 일치를 활용하여 샘플 난이도를 평가하고 신뢰할 수 있는 주석을 생성합니다; Judge-and-Refine 파이프라인은 렌더링 후 검증 반복 수정을 통해 어려운 샘플의 주석 품질을 향상시킵니다. 3단계 점진적 훈련 전략—대규모 사전 훈련, 난이도 샘플 미세 조정, GRPO 정렬—은 서로 다른 품질 계층의 이러한 데이터를 순차적으로 활용합니다. 평가 측면에서는 OmniDocBench~v1.5의 요소 매칭 편향을 수정하고 Hard 하위 집합을 도입하여 변별력이 더 높은 OmniDocBench~v1.6 프로토콜을 확립했습니다. 아키텍처 변경 없이, \minerupro는 OmniDocBench~v1.6에서 95.69점을 달성하여 동일 아키텍처 기준선보다 2.71점 향상되었으며, 매개변수가 200배 이상 많은 모델을 포함한 모든 기존 방법을 능가합니다.

English

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200times more parameters.

MinerU2.5-Pro: 데이터 중심 문서 구문 분석의 한계를 대규모로 확장하기

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

초록

Support