DocAtlas: 80以上の言語にわたる多言語文書理解

要旨

低リソース言語向けの多言語文書理解は、訓練データの不足と既存のバイアスを永続させるモデルベースのアノテーションパイプラインにより依然として限定的である。本稿では、82言語と9つの評価タスクをカバーする高忠実度OCRデータセットとベンチマークを構築するフレームワークDocAtlasを提案する。我々のデュアルパイプライン、すなわちネイティブDOCX文書の差分レンダリングと右横書きスクリプト向けの合成LaTeXベース生成は、コアアノテーションに学習モデルを用いることなく、レイアウト、テキスト、コンポーネントタイプをエンコードする統一DocTag形式で精密な構造アノテーションを生成する。16の最先端モデルを評価した結果、低リソーススクリプトにおける持続的なギャップが明らかとなった。レンダリング由来の正解データを正信号として用いる直接優先度最適化（DPO）が、教師ありファインチューニングがドメイン外性能を最大21%低下させるのに対し、ドメイン内（+1.9%）およびドメイン外（+1.8%）の精度をベース言語の顕著な劣化なく安定的に向上させることを示す。最良のバリエーションであるDocAtlas-DeepSeekは、最強のベースラインを+1.7%上回る。

English

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.