MegaHan97K: Een grootschalige dataset voor mega-categorieën Chinese karakters herkenning met meer dan 97.000 categorieën

Samenvatting

Fundamenteel voor de Chinese taal en cultuur, omvatten Chinese karakters buitengewoon uitgebreide en steeds uitbreidende categorieën, waarbij de nieuwste Chinese GB18030-2022 standaard 87.887 categorieën bevat. De nauwkeurige herkenning van dit enorme aantal karakters, aangeduid als mega-categorie herkenning, vormt een formidabele maar cruciale uitdaging voor het behoud van cultureel erfgoed en digitale toepassingen. Ondanks aanzienlijke vooruitgang in Optical Character Recognition (OCR), blijft mega-categorie herkenning onontgonnen vanwege het ontbreken van uitgebreide datasets, waarbij de grootste bestaande dataset slechts 16.151 categorieën bevat. Om dit kritieke gat te overbruggen, introduceren we MegaHan97K, een mega-categorie, grootschalige dataset die een ongekende 97.455 categorieën van Chinese karakters omvat. Ons werk biedt drie belangrijke bijdragen: (1) MegaHan97K is de eerste dataset die volledig de nieuwste GB18030-2022 standaard ondersteunt, en biedt minstens zes keer meer categorieën dan bestaande datasets; (2) Het lost effectief het long-tail distributieprobleem op door gebalanceerde samples te bieden voor alle categorieën via zijn drie verschillende subsets: handgeschreven, historische en synthetische subsets; (3) Uitgebreide benchmarkexperimenten onthullen nieuwe uitdagingen in mega-categorie scenario's, waaronder verhoogde opslagbehoeften, herkenning van morfologisch vergelijkbare karakters, en moeilijkheden bij zero-shot leren, terwijl het ook aanzienlijke mogelijkheden voor toekomstig onderzoek ontsluit. Voor zover wij weten, is MegaHan97K waarschijnlijk de dataset met de grootste klassen, niet alleen op het gebied van OCR, maar mogelijk ook in het bredere domein van patroonherkenning. De dataset is beschikbaar op https://github.com/SCUT-DLVCLab/MegaHan97K.

English

Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.

MegaHan97K: Een grootschalige dataset voor mega-categorieën Chinese karakters herkenning met meer dan 97.000 categorieën

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Samenvatting

Support