DocAtlas：覆盖80多种语言的多语言文档理解

摘要

由于训练数据稀缺以及基于模型的标注流程持续放大现有偏差，低资源语言的多语言文档理解仍存在显著局限。我们提出DocAtlas框架，该框架构建了覆盖82种语言和9项评估任务的高保真OCR数据集与基准。我们的双流程方案——对原生DOCX文档进行差分渲染，以及对从右至左书写系统采用基于LaTeX的合成生成——可在不依赖核心标注学习模型的情况下，通过统一DocTag格式（编码布局、文本和组件类型）生成精确的结构化标注。对16个前沿模型的评估揭示了低资源文字领域持续存在的性能鸿沟。研究表明，利用渲染生成的ground truth作为正例信号进行直接偏好优化（DPO），可实现稳定的多语言适配，域内准确率提升1.9%、域外准确率提升1.8%，且未对基准语言造成可测量的性能退化；而监督微调会使域外性能下降高达21%。我们的最佳变体DocAtlas-DeepSeek相比最强基线模型提升了1.7%。

English

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.