ChatPaper.aiChatPaper

QARI-OCR:通过多模态大语言模型适配实现高保真阿拉伯文本识别

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

June 2, 2025
作者: Ahmed Wasfy, Omer Nacar, Abdelakreem Elkhateb, Mahmoud Reda, Omar Elshehy, Adel Ammar, Wadii Boulila
cs.AI

摘要

阿拉伯文字固有的复杂性,包括其连笔特性、变音符号(tashkeel)以及多样化的字体样式,为光学字符识别(OCR)技术带来了持续的挑战。我们推出了Qari-OCR,这是一系列基于Qwen2-VL-2B-Instruct的视觉-语言模型,通过针对专门合成的数据集进行迭代微调,逐步优化以适应阿拉伯文处理。我们的领先模型QARI v0.2,在富含变音符号的文本上,以0.160的单词错误率(WER)、0.061的字符错误率(CER)以及0.737的BLEU得分,确立了开源领域的新标杆。Qari-OCR在处理变音符号、多样字体及文档布局方面展现出卓越能力,同时在低分辨率图像上的表现亦令人瞩目。进一步的探索(QARI v0.3)揭示了其在结构化文档理解与手写文本识别方面的强大潜力。本工作显著提升了阿拉伯文OCR的准确性与效率,并公开了所有模型与数据集,以促进后续研究。
English
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

Summary

AI-Generated Summary

PDF22June 4, 2025