Baseer：面向阿拉伯文檔至Markdown OCR的視覺語言模型

摘要

阿拉伯文件的光學字符識別（OCR）由於其連筆書寫、多樣字體、變音符號以及從右至左的書寫方向，仍然是一項具有挑戰性的任務。儘管現代多模態大型語言模型（MLLMs）在高資源語言的文檔理解方面取得了進展，但對阿拉伯語的表現仍有限。在本研究中，我們介紹了Baseer，這是一個專門針對阿拉伯文件OCR進行微調的視覺-語言模型。利用結合合成與真實世界文件的大規模數據集，Baseer通過僅解碼器的微調策略進行訓練，以適應預訓練的MLLM，同時保留一般視覺特徵。我們還提出了Misraj-DocOCR，這是一個高質量、經專家驗證的基準，旨在嚴格評估阿拉伯OCR系統。我們的實驗表明，Baseer顯著優於現有的開源和商業解決方案，達到了0.25的詞錯誤率（WER），並在阿拉伯文件OCR領域建立了新的技術標準。我們的結果強調了對通用MLLMs進行領域特定適應的優勢，並為像阿拉伯語這樣形態豐富的語言的高精度OCR建立了堅實的基準。

English

Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.