Baseer：面向阿拉伯文档转Markdown的视觉语言OCR模型

摘要

阿拉伯文档的光学字符识别（OCR）因其连笔书写、多样字体、变音符号以及从右至左的书写方向而始终面临挑战。尽管现代多模态大语言模型（MLLMs）在高资源语言的文档理解方面取得了进展，但在阿拉伯语上的表现仍显不足。本研究推出了Baseer，一个专门针对阿拉伯文档OCR进行微调的视觉-语言模型。通过结合合成与真实世界文档的大规模数据集，Baseer采用仅解码器的微调策略，在保留通用视觉特征的同时，对预训练的MLLM进行适配。我们还介绍了Misraj-DocOCR，这是一个高质量、经专家验证的基准测试集，旨在严格评估阿拉伯语OCR系统。实验结果表明，Baseer显著超越了现有的开源及商业解决方案，实现了0.25的单词错误率（WER），在阿拉伯文档OCR领域树立了新的技术标杆。我们的研究结果凸显了针对特定领域对通用MLLMs进行适配的优势，并为阿拉伯语等形态丰富语言的高精度OCR建立了坚实的基线。

English

Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.