DocAtlas: 80개 이상의 언어를 지원하는 다국어 문서 이해

초록

다국어 문서 이해는 학습 데이터 부족과 기존 편향을 강화하는 모델 기반 주석 파이프라인으로 인해 저자원 언어에서 여전히 제한적이다. 본 연구에서는 82개 언어와 9가지 평가 과제를涵盖하는 고충실도 OCR 데이터셋과 벤치마크를 구축하는 프레임워크인 DocAtlas를 제안한다. 네이티브 DOCX 문서의 차등 렌더링(differential rendering)과 우횡서(RTL) 스크립트를 위한 합성 LaTeX 기반 생성이라는 두 가지 파이프라인을 통해, 핵심 주석에 학습 모델을 사용하지 않고 레이아웃, 텍스트, 구성 요소 유형을 인코딩하는 통합 DocTag 형식의 정밀한 구조적 주석을 생성한다. 16개의 최신 모델을 평가한 결과, 저자원 스크립트에서 지속적인 격차가 확인되었다. 렌더링에서 도출된 정답 신호(Ground Truth)를 양성 신호로 사용하는 직접 선호 최적화(DPO)가 안정적인 다국어 적응을 달성하여, 지도 미세 조정이 도메인 외 성능을 최대 21% 저하시키는 반면, DPO는 도메인 내 정확도(+1.9%)와 도메인 외 정확도(+1.8%)를 모두 개선하고 기본 언어 성능 저하를 측정 가능한 수준에서 발생시키지 않음을 보인다. 최적 변형 모델인 DocAtlas-DeepSeek는 가장 강력한 기준 모델 대비 +1.7%의 성능 향상을 달성한다.

English

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.