ChatPaper.aiChatPaper

Baseer:面向阿拉伯文档转Markdown的视觉语言OCR模型

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

September 17, 2025
作者: Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
cs.AI

摘要

阿拉伯文档的光学字符识别(OCR)因其连笔书写、多样字体、变音符号以及从右至左的书写方向而始终面临挑战。尽管现代多模态大语言模型(MLLMs)在高资源语言的文档理解方面取得了进展,但在阿拉伯语上的表现仍显不足。本研究推出了Baseer,一个专门针对阿拉伯文档OCR进行微调的视觉-语言模型。通过结合合成与真实世界文档的大规模数据集,Baseer采用仅解码器的微调策略,在保留通用视觉特征的同时,对预训练的MLLM进行适配。我们还介绍了Misraj-DocOCR,这是一个高质量、经专家验证的基准测试集,旨在严格评估阿拉伯语OCR系统。实验结果表明,Baseer显著超越了现有的开源及商业解决方案,实现了0.25的单词错误率(WER),在阿拉伯文档OCR领域树立了新的技术标杆。我们的研究结果凸显了针对特定领域对通用MLLMs进行适配的优势,并为阿拉伯语等形态丰富语言的高精度OCR建立了坚实的基线。
English
Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.
PDF1215September 24, 2025