CHURRO:透過開放權重的大型視覺語言模型,實現高精度、低成本歷史文本識別,讓歷史變得可讀
CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
September 24, 2025
作者: Sina J. Semnani, Han Zhang, Xinyan He, Merve Tekgürler, Monica S. Lam
cs.AI
摘要
精確的歷史文獻文字識別技術能大幅推進文化遺產的研究與保存。然而,現有的視覺-語言模型(VLMs)主要針對現代標準化文本設計,無法有效處理歷史材料中多樣的語言與文字、不規則的版面佈局,以及常見的損壞情況。
本文介紹了CHURRO,一個專為歷史文本識別設計的3B參數開源權重視覺-語言模型。該模型基於迄今為止最大的歷史文本識別數據集CHURRO-DS進行訓練。CHURRO-DS整合了155個歷史語料庫,包含99,491頁文獻,跨越22個世紀的文字遺產,涵蓋46種語言群體,包括歷史變體和已消亡的語言。
我們在CHURRO-DS上評估了多個開源與閉源視覺-語言模型及光學字符識別(OCR)系統,發現CHURRO在所有視覺-語言模型中表現最佳。在CHURRO-DS測試集上,CHURRO在印刷體和手寫體上分別達到了82.3%和70.1%的標準化Levenshtein相似度,分別比第二名的Gemini 2.5 Pro高出1.4%和6.5%,同時成本效益高出15.5倍。
通過公開模型與數據集,我們期望能促進社群驅動的研究,提升歷史文本的可讀性,並加速學術研究進程。
English
Accurate text recognition for historical documents can greatly advance the
study and preservation of cultural heritage. Existing vision-language models
(VLMs), however, are designed for modern, standardized texts and are not
equipped to read the diverse languages and scripts, irregular layouts, and
frequent degradation found in historical materials.
This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for
historical text recognition. The model is trained on CHURRO-DS, the largest
historical text recognition dataset to date. CHURRO-DS unifies 155 historical
corpora comprising 99,491 pages, spanning 22 centuries of textual heritage
across 46 language clusters, including historical variants and dead languages.
We evaluate several open-weight and closed VLMs and optical character
recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all
other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and
70.1% (handwritten) normalized Levenshtein similarity, surpassing the
second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being
15.5 times more cost-effective.
By releasing the model and dataset, we aim to enable community-driven
research to improve the readability of historical texts and accelerate
scholarship.