POINTS-Reader: 문서 변환을 위한 비증류 방식의 시각-언어 모델 적응

초록

고품질의 레이블 데이터는 테이블, 수식, 다단 텍스트 등 복잡한 형식을 가진 도메인에서 정확한 문서 변환 모델을 학습시키는 데 필수적입니다. 그러나 수동 주석 작업은 비용과 시간이 많이 소요되며, 기존 모델을 사용한 자동 레이블링은 이러한 까다로운 시나리오를 처리하는 데 정확도가 부족한 경우가 많습니다. 결과적으로, 교사 모델의 출력을 증류하여 학생 모델을 학습시키는 방식은 실제 응용 프로그램에서의 성능을 크게 제한할 수 있습니다. 본 논문에서는 다양한 문서 형식과 레이아웃을 처리할 수 있는 고품질 문서 추출 데이터셋과 모델을 구축하기 위한 두 단계로 구성된 완전 자동화된 증류 없는 프레임워크를 제안합니다. 첫 번째 단계에서는 대규모의 다양한 합성 데이터를 생성하는 방법을 소개하여, 모델이 통일된 형식으로 주요 요소를 추출할 수 있도록 하여 강력한 초기 성능을 달성합니다. 두 번째 단계에서는 합성 데이터로 초기 학습된 모델을 실제 문서에 적응시키는 자기 개선 접근 방식을 제시합니다. 구체적으로, 미세 조정된 모델을 사용하여 실제 문서에 주석을 달고, 주석 품질을 검증하기 위한 일련의 필터링 전략을 적용한 후, 검증된 데이터셋으로 모델을 재학습합니다. 이 과정을 반복적으로 수행함으로써 모델의 변환 능력과 생성된 데이터의 품질을 점진적으로 향상시킵니다. 우리는 공개된 POINTS-1.5 모델을 학습시켜 POINTS-Reader를 얻었으며, 이는 유사하거나 더 큰 규모의 기존 공개 및 사유 모델들을 능가합니다. 우리의 모델은 https://github.com/Tencent/POINTS-Reader에서 확인할 수 있습니다.

English

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

POINTS-Reader: 문서 변환을 위한 비증류 방식의 시각-언어 모델 적응

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

초록

Support