GlotOCR 벤치: OCR 모델은 소수의 유니코드 문자 집합을 넘어서면 여전히 어려움을 겪는다

초록

광학 문자 인식(OCR) 기술은 비전-언어 모델의 부상과 함께 빠르게 발전해 왔으나, 평가는 여전히 소수의 고자원 및 중간 자원 문자 체계에 집중되어 있습니다. 본 논문에서는 100개 이상의 유니코드 문자 체계에 걸친 OCR 일반화 성능을 평가하는 포괄적인 벤치마크인 GlotOCR Bench를 소개합니다. 우리의 벤치마크는 실제 다국어 텍스트에서 렌더링된 깨끗한 이미지와 열화된 이미지 변형으로 구성됩니다. 이미지는 Google Fonts 저장소의 글꼴을 사용하여 HarfBuzz로 조형(shaping)되고 FreeType으로 래스터화되어 LTR 및 RTL 문자 체계를 모두 지원합니다. 렌더링된 이미지 샘플은 모든 문자 체계에서의 정확한 렌더링을 검증하기 위해 수동으로 검토되었습니다. 우리는 다양한 오픈 웨이트 및 사유 비전-언어 모델을 평가한 결과, 대부분의 모델이 10개 미만의 문자 체계에서만 우수한 성능을 보이며, 가장 강력한 최첨단 모델조차 30개 이상의 문자 체계로는 일반화되지 못하는 것을 확인했습니다. 성능은 전반적으로 문자 체계 수준의 사전 학습 커버리지를 따라가는 경향을 보여, 현재 OCR 시스템이 시각 인식만큼 언어 모델 사전 학습에 의존하고 있음을 시사합니다. 익숙하지 않은 문자 체계를 접한 모델들은 무작위 노이즈를 생성하거나 이미 알고 있는 유사한 문자 체계의 문자를 환각(hallucinate)하는 경우가 많습니다. 우리는 재현성을 위해 벤치마크와 파이프라인을 공개합니다. 파이프라인 코드: https://github.com/cisnlp/glotocr-bench, 벤치마크: https://hf.co/datasets/cis-lmu/glotocr-bench.

English

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

GlotOCR 벤치: OCR 모델은 소수의 유니코드 문자 집합을 넘어서면 여전히 어려움을 겪는다

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

초록

Support