MegaHan97K: 97,000개 이상의 범주를 포함한 대규모 중국어 문자 인식 데이터셋

초록

중국어와 문화의 기초를 이루는 한자는 매우 광범위하고 지속적으로 확장되는 범주를 포괄하며, 최신 중국 GB18030-2022 표준에는 87,887개의 범주가 포함되어 있습니다. 이 방대한 수의 한자를 정확하게 인식하는 것은 메가 카테고리 인식으로 불리며, 문화 유산 보존과 디지털 응용 분야에서 매우 중요하면서도 어려운 과제입니다. 광학 문자 인식(OCR) 기술이 크게 발전했음에도 불구하고, 메가 카테고리 인식은 포괄적인 데이터셋의 부재로 인해 아직 탐구되지 않은 상태이며, 기존의 가장 큰 데이터셋은 단 16,151개의 범주만을 포함하고 있습니다. 이러한 중요한 격차를 해소하기 위해, 우리는 전례 없는 97,455개의 한자 범주를 포함하는 메가 카테고리 대규모 데이터셋인 MegaHan97K를 소개합니다. 우리의 작업은 세 가지 주요 기여를 제공합니다: (1) MegaHan97K는 최신 GB18030-2022 표준을 완전히 지원하는 첫 번째 데이터셋으로, 기존 데이터셋보다 최소 6배 이상 많은 범주를 제공합니다; (2) 세 가지 독특한 하위 집합(필기체, 역사적, 합성 하위 집합)을 통해 모든 범주에 걸쳐 균형 잡힌 샘플을 제공함으로써 장기 꼬리 분포 문제를 효과적으로 해결합니다; (3) 포괄적인 벤치마킹 실험을 통해 메가 카테고리 시나리오에서의 새로운 도전 과제, 즉 저장 공간 요구 증가, 형태적으로 유사한 문자 인식, 제로샷 학습의 어려움 등을 밝히는 동시에 미래 연구를 위한 상당한 기회를 제공합니다. 우리가 아는 한, MetaHan97K는 OCR 분야뿐만 아니라 패턴 인식의 더 넓은 영역에서도 가장 큰 클래스를 가진 데이터셋일 가능성이 높습니다. 이 데이터셋은 https://github.com/SCUT-DLVCLab/MegaHan97K에서 이용 가능합니다.

English

Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.