DocAtlas:覆盖80多种语言的多语言文档理解
DocAtlas: Multilingual Document Understanding Across 80+ Languages
May 12, 2026
作者: Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan
cs.AI
摘要
由于训练数据稀缺以及基于模型的标注流程持续放大现有偏差,低资源语言的多语言文档理解仍存在显著局限。我们提出DocAtlas框架,该框架构建了覆盖82种语言和9项评估任务的高保真OCR数据集与基准。我们的双流程方案——对原生DOCX文档进行差分渲染,以及对从右至左书写系统采用基于LaTeX的合成生成——可在不依赖核心标注学习模型的情况下,通过统一DocTag格式(编码布局、文本和组件类型)生成精确的结构化标注。对16个前沿模型的评估揭示了低资源文字领域持续存在的性能鸿沟。研究表明,利用渲染生成的ground truth作为正例信号进行直接偏好优化(DPO),可实现稳定的多语言适配,域内准确率提升1.9%、域外准确率提升1.8%,且未对基准语言造成可测量的性能退化;而监督微调会使域外性能下降高达21%。我们的最佳变体DocAtlas-DeepSeek相比最强基线模型提升了1.7%。
English
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.