GlotOCR基准测试：现有OCR模型仍难以应对少数Unicode文字体系之外的识别任务

摘要

随着视觉语言模型的兴起，光学字符识别（OCR）技术发展迅猛，但其评估工作仍集中于少数高资源和中资源文字体系。我们推出GlotOCR Bench这一综合性基准测试，用于评估OCR模型在100多种Unicode文字体系上的泛化能力。该基准包含由真实多语言文本生成的清晰版与退化版图像变体，采用Google Fonts资源库的字体，通过HarfBuzz进行字形规划，并利用FreeType支持从左至右（LTR）和从右至左（RTL）文字体系的栅格化渲染。所有文字体系的渲染样本均经过人工核验以确保正确显示。通过对一系列开源与专有视觉语言模型的评估，我们发现大多数模型仅在不到十种文字体系上表现良好，即使最先进的顶尖模型也难以泛化至三十种以上文字体系。模型表现与文字体系的预训练覆盖度高度相关，表明当前OCR系统对语言模型预训练的依赖程度不亚于视觉识别能力。当面对陌生文字时，模型要么生成随机乱码，要么从其已掌握的相似文字体系中幻觉式生成字符。我们公开基准测试数据与复现流程：流程代码详见https://github.com/cisnlp/glotocr-bench，基准数据集详见https://hf.co/datasets/cis-lmu/glotocr-bench。

English

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

GlotOCR基准测试：现有OCR模型仍难以应对少数Unicode文字体系之外的识别任务

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

摘要

Support