Baseer:面向阿拉伯文档转Markdown的视觉语言OCR模型
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
September 17, 2025
作者: Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
cs.AI
摘要
阿拉伯文档的光学字符识别(OCR)因其连笔书写、多样字体、变音符号以及从右至左的书写方向而始终面临挑战。尽管现代多模态大语言模型(MLLMs)在高资源语言的文档理解方面取得了进展,但在阿拉伯语上的表现仍显不足。本研究推出了Baseer,一个专门针对阿拉伯文档OCR进行微调的视觉-语言模型。通过结合合成与真实世界文档的大规模数据集,Baseer采用仅解码器的微调策略,在保留通用视觉特征的同时,对预训练的MLLM进行适配。我们还介绍了Misraj-DocOCR,这是一个高质量、经专家验证的基准测试集,旨在严格评估阿拉伯语OCR系统。实验结果表明,Baseer显著超越了现有的开源及商业解决方案,实现了0.25的单词错误率(WER),在阿拉伯文档OCR领域树立了新的技术标杆。我们的研究结果凸显了针对特定领域对通用MLLMs进行适配的优势,并为阿拉伯语等形态丰富语言的高精度OCR建立了坚实的基线。
English
Arabic document OCR remains a challenging task due to the language's cursive
script, diverse fonts, diacritics, and right-to-left orientation. While modern
Multimodal Large Language Models (MLLMs) have advanced document understanding
for high-resource languages, their performance on Arabic remains limited. In
this work, we introduce Baseer, a vision-language model fine- tuned
specifically for Arabic document OCR. Leveraging a large-scale dataset
combining synthetic and real-world documents, Baseer is trained using a
decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving
general visual features. We also present Misraj-DocOCR, a high-quality,
expert-verified benchmark designed for rigorous evaluation of Arabic OCR
systems. Our experiments show that Baseer significantly outperforms existing
open-source and commercial solutions, achieving a WER of 0.25 and establishing
a new state-of-the-art in the domain of Arabic document OCR. Our results
highlight the benefits of domain-specific adaptation of general-purpose MLLMs
and establish a strong baseline for high-accuracy OCR on morphologically rich
languages like Arabic.