DocAtlas：跨越80多種語言的多語言文檔理解

摘要

多語言文件理解在低資源語言上仍受限於訓練資料稀缺及基於模型的標註流程，這些流程會延續既有的偏誤。我們提出 DocAtlas 架構，能建構涵蓋 82 種語言及 9 項評估任務的高保真 OCR 資料集與基準。我們的雙重流程——對原生 DOCX 文件進行差異化渲染，以及針對從右至左書寫文字的合成 LaTeX 生成——能在統一的 DocTag 格式中產生精確的結構化標註，該格式編碼版面、文字及元件類型，且核心標註過程無需依賴學習模型。評估 16 個最先進模型後，發現低資源書寫系統仍存在持續的差距。我們證明，使用渲染產生的真實資料作為正向訊號進行直接偏好最佳化（DPO），能實現穩定的多語言適應，在領域內（+1.9%）及領域外（+1.8%）準確率上均有所提升，且未造成可測量的基線語言退化；而有監督微調則使領域外效能下降高達 21%。我們最佳變體 DocAtlas-DeepSeek 在最強基準上提升了 1.7%。

English

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.