Baseer:面向阿拉伯文檔至Markdown OCR的視覺語言模型
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
September 17, 2025
作者: Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
cs.AI
摘要
阿拉伯文件的光學字符識別(OCR)由於其連筆書寫、多樣字體、變音符號以及從右至左的書寫方向,仍然是一項具有挑戰性的任務。儘管現代多模態大型語言模型(MLLMs)在高資源語言的文檔理解方面取得了進展,但對阿拉伯語的表現仍有限。在本研究中,我們介紹了Baseer,這是一個專門針對阿拉伯文件OCR進行微調的視覺-語言模型。利用結合合成與真實世界文件的大規模數據集,Baseer通過僅解碼器的微調策略進行訓練,以適應預訓練的MLLM,同時保留一般視覺特徵。我們還提出了Misraj-DocOCR,這是一個高質量、經專家驗證的基準,旨在嚴格評估阿拉伯OCR系統。我們的實驗表明,Baseer顯著優於現有的開源和商業解決方案,達到了0.25的詞錯誤率(WER),並在阿拉伯文件OCR領域建立了新的技術標準。我們的結果強調了對通用MLLMs進行領域特定適應的優勢,並為像阿拉伯語這樣形態豐富的語言的高精度OCR建立了堅實的基準。
English
Arabic document OCR remains a challenging task due to the language's cursive
script, diverse fonts, diacritics, and right-to-left orientation. While modern
Multimodal Large Language Models (MLLMs) have advanced document understanding
for high-resource languages, their performance on Arabic remains limited. In
this work, we introduce Baseer, a vision-language model fine- tuned
specifically for Arabic document OCR. Leveraging a large-scale dataset
combining synthetic and real-world documents, Baseer is trained using a
decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving
general visual features. We also present Misraj-DocOCR, a high-quality,
expert-verified benchmark designed for rigorous evaluation of Arabic OCR
systems. Our experiments show that Baseer significantly outperforms existing
open-source and commercial solutions, achieving a WER of 0.25 and establishing
a new state-of-the-art in the domain of Arabic document OCR. Our results
highlight the benefits of domain-specific adaptation of general-purpose MLLMs
and establish a strong baseline for high-accuracy OCR on morphologically rich
languages like Arabic.