CHURRO: 高精度かつ低コストな歴史的テキスト認識のためのオープンウェイト大規模視覚言語モデルによる歴史の可読化

要旨

歴史文書の正確なテキスト認識は、文化遺産の研究と保存を大きく進展させることができる。しかし、既存の視覚言語モデル（VLM）は、現代の標準化されたテキスト向けに設計されており、歴史資料に見られる多様な言語や文字、不規則なレイアウト、頻繁な劣化に対応するようには作られていない。本論文では、歴史的テキスト認識に特化した3BパラメータのオープンウェイトVLMであるCHURROを紹介する。このモデルは、これまでで最大の歴史的テキスト認識データセットであるCHURRO-DSで訓練されている。CHURRO-DSは、22世紀にわたる46の言語クラスター（歴史的変種や死語を含む）にまたがる99,491ページの155の歴史的コーパスを統合している。我々は、CHURRO-DS上で複数のオープンウェイトおよびクローズドVLM、および光学文字認識（OCR）システムを評価し、CHURROが他のすべてのVLMを上回ることを確認した。CHURRO-DSのテストセットにおいて、CHURROは82.3%（印刷）および70.1%（手書き）の正規化レーベンシュタイン類似度を達成し、2番目に優れたモデルであるGemini 2.5 Proをそれぞれ1.4%および6.5%上回りながら、15.5倍のコスト効率を実現した。モデルとデータセットを公開することで、歴史的テキストの可読性を向上させ、学術研究を加速するためのコミュニティ主導の研究を可能にすることを目指している。

English

Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.