QARI-OCR：マルチモーダル大規模言語モデル適応による高忠実度アラビア文字認識

要旨

アラビア文字の持つ固有の複雑さ、すなわちその筆記体の性質、発音記号（タシュキール）、そして多様な書体は、光学文字認識（OCR）にとって持続的な課題となっています。本論文では、Qwen2-VL-2B-Instructから派生した一連の視覚言語モデルであるQari-OCRを紹介します。これらのモデルは、専門的に合成されたデータセットに対する反復的なファインチューニングを通じて、アラビア語に最適化されています。我々の主要モデルであるQARI v0.2は、発音記号が豊富なテキストにおいて、単語誤り率（WER）0.160、文字誤り率（CER）0.061、BLEUスコア0.737を達成し、新たなオープンソースの最先端を確立しました。Qari-OCRは、タシュキールの処理、多様なフォント、ドキュメントレイアウトにおいて優れた性能を示し、低解像度画像に対しても印象的な性能を発揮します。さらなる探求（QARI v0.3）では、構造的なドキュメント理解と手書き文字に対する強い潜在能力を示しています。本研究は、アラビア語OCRの精度と効率を大幅に向上させ、すべてのモデルとデータセットを公開してさらなる研究を促進するものです。

English

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.