GlotOCR基准测试：OCR模型在少数Unicode文字体系外仍表现不佳

摘要

随着视觉语言模型的兴起，光学字符识别（OCR）技术发展迅猛，但其评估仍集中于少数高资源和中资源文字体系。我们推出GlotOCR Bench——一个涵盖100多种Unicode文字体系的综合性OCR泛化能力评估基准。该基准包含由真实多语言文本生成的清晰版与退化版图像变体，采用Google Fonts字库中的字体，通过HarfBuzz进行字形规划，并利用FreeType支持从左到右和从右到左文字体系的栅格化渲染。所有文字体系的渲染样本均经过人工核验以确保正确显示。我们对多款开源权重及专有视觉语言模型进行评估，发现大多数模型仅在不到十种文字体系上表现良好，即使最先进的顶尖模型也难以泛化至三十种以上文字体系。模型表现与文字体系的预训练覆盖度高度相关，表明当前OCR系统对语言模型预训练的依赖程度不亚于视觉识别能力。当面对陌生文字时，模型要么生成随机乱码，要么从其已掌握的相似文字体系中幻觉出字符。我们公开基准数据集与复现流程以促进可重复研究。流程代码：https://github.com/cisnlp/glotocr-bench，基准数据集：https://hf.co/datasets/cis-lmu/glotocr-bench。

English

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.