ChatPaper.aiChatPaper

QARI-OCR:通過多模態大型語言模型適應實現高保真阿拉伯文文本識別

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

June 2, 2025
作者: Ahmed Wasfy, Omer Nacar, Abdelakreem Elkhateb, Mahmoud Reda, Omar Elshehy, Adel Ammar, Wadii Boulila
cs.AI

摘要

阿拉伯文字固有的複雜性,包括其連筆特性、變音符號(tashkeel)以及多樣的排版風格,對光學字符識別(OCR)技術構成了持續的挑戰。我們推出了Qari-OCR,這是一系列基於Qwen2-VL-2B-Instruct的視覺-語言模型,通過在專門合成的數據集上進行迭代微調,逐步優化以適應阿拉伯文。我們的主導模型QARI v0.2,在富含變音符號的文本上,以0.160的詞錯誤率(WER)、0.061的字符錯誤率(CER)以及0.737的BLEU分數,確立了開源領域的新標杆。Qari-OCR展現了對tashkeel、多種字體及文檔佈局的卓越處理能力,同時在低分辨率圖像上亦表現出色。進一步的探索(QARI v0.3)顯示出在結構化文檔理解與手寫文本識別方面的強大潛力。此項工作顯著提升了阿拉伯文OCR的準確性與效率,並公開了所有模型與數據集,以促進更深入的研究。
English
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
PDF52June 4, 2025