Baseer: アラビア語文書からMarkdownへのOCRのための視覚言語モデル

要旨

アラビア語文書のOCRは、筆記体の文字、多様なフォント、発音記号、そして右から左への記述方向といった言語的特性により、依然として困難な課題となっています。現代のマルチモーダル大規模言語モデル（MLLMs）は、高リソース言語における文書理解を大きく進展させてきましたが、アラビア語での性能は限定的です。本研究では、アラビア語文書OCRに特化してファインチューニングされた視覚言語モデル「Baseer」を紹介します。Baseerは、合成データと実世界の文書を組み合わせた大規模データセットを活用し、事前学習済みMLLMを適応させるためのデコーダのみのファインチューニング戦略を用いて訓練され、一般的な視覚的特徴を保持します。また、アラビア語OCRシステムの厳密な評価のために設計された、専門家による検証を経た高品質なベンチマーク「Misraj-DocOCR」を提示します。実験の結果、Baseerは既存のオープンソースおよび商用ソリューションを大幅に上回り、WER（単語誤り率）0.25を達成し、アラビア語文書OCRの分野で新たな最先端を確立しました。これらの結果は、汎用MLLMのドメイン特化適応の利点を強調し、アラビア語のような形態的に豊かな言語における高精度OCRの強力なベースラインを確立するものです。

English

Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.