MegaHan97K：97,000以上のカテゴリを網羅する大規模漢字認識データセット

要旨

中国語と中国文化の基盤をなす漢字は、非常に広範かつ拡大し続けるカテゴリーを包含しており、最新の中国GB18030-2022標準では87,887のカテゴリーが含まれています。この膨大な数の文字を正確に認識すること、すなわちメガカテゴリー認識は、文化遺産の保存とデジタルアプリケーションにとって極めて重要でありながらも困難な課題です。光学文字認識（OCR）の分野では大きな進展が見られるものの、包括的なデータセットの欠如により、メガカテゴリー認識は未開拓のままです。既存の最大のデータセットでも16,151カテゴリーしか含まれていません。この重要なギャップを埋めるため、私たちはMegaHan97Kを紹介します。これは、前例のない97,455カテゴリーの漢字をカバーするメガカテゴリー大規模データセットです。私たちの研究は以下の3つの主要な貢献を提供します：（1）MegaHan97Kは、最新のGB18030-2022標準を完全にサポートする初めてのデータセットであり、既存のデータセットの少なくとも6倍以上のカテゴリーを提供します。（2）手書き、歴史的、合成の3つの異なるサブセットを通じて、すべてのカテゴリーにわたるバランスの取れたサンプルを提供し、ロングテール分布問題を効果的に解決します。（3）包括的なベンチマーク実験により、メガカテゴリーシナリオにおける新たな課題、すなわち増大するストレージ需要、形態的に類似した文字の認識、ゼロショット学習の困難さが明らかになる一方で、今後の研究に向けた大きな可能性も開かれます。私たちの知る限り、MetaHan97KはOCR分野だけでなく、パターン認識のより広範な領域においても、最大のクラスを持つデータセットである可能性が高いです。データセットはhttps://github.com/SCUT-DLVCLab/MegaHan97Kで公開されています。

English

Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.