ChatPaper.aiChatPaper

CHURRO:通过开源大规模视觉语言模型实现高精度、低成本的历史文本识别,让历史文献更易读

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

September 24, 2025
作者: Sina J. Semnani, Han Zhang, Xinyan He, Merve Tekgürler, Monica S. Lam
cs.AI

摘要

准确识别历史文献中的文字,对于推进文化遗产的研究与保护具有重大意义。然而,现有的视觉-语言模型(VLMs)主要针对现代标准化文本设计,难以应对历史材料中多样的语言与书写体系、不规则的版面布局以及常见的退化现象。 本文介绍了CHURRO,一个专为历史文本识别设计的、拥有30亿参数的开源权重视觉-语言模型。该模型基于CHURRO-DS进行训练,这是迄今为止最大的历史文本识别数据集。CHURRO-DS整合了155个历史语料库,涵盖99,491页文献,跨越22个世纪,涉及46种语言群体,包括历史变体和已消亡的语言。 我们对多个开源及闭源的视觉-语言模型以及光学字符识别(OCR)系统在CHURRO-DS上的表现进行了评估,发现CHURRO在所有视觉-语言模型中表现最优。在CHURRO-DS测试集上,CHURRO在印刷体和手写体文本上的归一化Levenshtein相似度分别达到82.3%和70.1%,分别比第二名的Gemini 2.5 Pro高出1.4%和6.5%,同时成本效益提高了15.5倍。 通过公开发布模型和数据集,我们旨在促进社区驱动的研究,以提升历史文本的可读性,加速学术探索。
English
Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.
PDF22September 29, 2025