MinerU2.5: 고해상도 문서 파싱을 위한 효율적인 비전-언어 디커플링 모델

초록

우리는 12억 개의 파라미터를 가진 문서 파싱 비전-언어 모델인 MinerU2.5를 소개한다. 이 모델은 최첨단 인식 정확도를 달성하면서도 탁월한 계산 효율성을 유지한다. 우리의 접근 방식은 전역 레이아웃 분석과 지역 콘텐츠 인식을 분리하는, coarse-to-fine(거친 단계에서 세밀한 단계로) 두 단계 파싱 전략을 채택한다. 첫 번째 단계에서 모델은 다운샘플링된 이미지에 대해 효율적인 레이아웃 분석을 수행하여 구조적 요소를 식별함으로써 고해상도 입력을 처리하는 데 따르는 계산 부하를 회피한다. 두 번째 단계에서는 전역 레이아웃의 지도를 받아 원본 이미지에서 추출한 원본 해상도의 크롭 영역에 대해 타겟팅된 콘텐츠 인식을 수행함으로써, 밀집된 텍스트, 복잡한 수식, 테이블 등에서 세밀한 디테일을 보존한다. 이 전략을 지원하기 위해, 우리는 사전 학습과 미세 조정을 위한 다양하고 대규모의 훈련 코퍼스를 생성하는 포괄적인 데이터 엔진을 개발했다. 결과적으로 MinerU2.5는 강력한 문서 파싱 능력을 보여주며, 여러 벤치마크에서 최첨단 성능을 달성하여 다양한 인식 작업에서 범용 모델과 도메인 특화 모델을 모두 능가하면서도 상당히 낮은 계산 부하를 유지한다.

English

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

MinerU2.5: 고해상도 문서 파싱을 위한 효율적인 비전-언어 디커플링 모델

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

초록

Support