QARI-OCR: 다중 모드 대형 언어 모델 적응을 통한 고품질 아랍어 텍스트 인식

초록

아랍어 스크립트의 내재적 복잡성; 즉, 필기체 특성, 발음 구별 기호(타쉬킬), 그리고 다양한 타이포그래피는 광학 문자 인식(OCR)에 지속적인 도전 과제로 작용합니다. 본 연구에서는 Qwen2-VL-2B-Instruct에서 파생된 일련의 비전-언어 모델인 Qari-OCR을 제안하며, 특수한 합성 데이터셋에 대한 반복적인 미세 조정을 통해 아랍어에 점진적으로 최적화되었습니다. 우리의 주력 모델인 QARI v0.2는 발음 구별 기호가 풍부한 텍스트에서 단어 오류율(WER) 0.160, 문자 오류율(CER) 0.061, 그리고 BLEU 점수 0.737을 달성하여 새로운 오픈소스 최첨단 기술을 확립했습니다. Qari-OCR은 타쉬킬 처리, 다양한 폰트, 문서 레이아웃에서 우수한 성능을 보여주며, 저해상도 이미지에서도 인상적인 성과를 보였습니다. 추가 탐구(QARI v0.3)는 구조적 문서 이해와 필기체 텍스트에 대한 강력한 잠재력을 입증했습니다. 이 연구는 아랍어 OCR의 정확성과 효율성을 크게 개선하였으며, 모든 모델과 데이터셋을 공개하여 추가 연구를 촉진하고자 합니다.

English

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.