CMHG: 중국 소수민족 언어 헤드라인 생성을 위한 데이터셋 및 벤치마크

초록

중국의 티베트어, 위구르어, 전통 몽골어 등 소수민족 언어들은 국제 표준과 다른 독특한 문자 체계로 인해 상당한 어려움에 직면해 있습니다. 이러한 차이로 인해 특히 헤드라인 생성과 같은 지도 학습 작업을 위한 관련 코퍼스가 심각하게 부족한 상황입니다. 이러한 격차를 해소하기 위해, 우리는 티베트어 10만 개, 위구르어와 몽골어 각각 5만 개의 항목으로 구성된 새로운 데이터셋인 중국 소수민족 헤드라인 생성(CMHG)을 소개합니다. 이 데이터셋은 헤드라인 생성 작업을 위해 특별히 제작되었습니다. 또한, 원어민이 주석을 단 고품질 테스트 세트를 제안하여, 이 분야의 향후 연구를 위한 벤치마크로 활용될 수 있도록 설계했습니다. 우리는 이 데이터셋이 중국 소수민족 언어의 헤드라인 생성 발전에 유용한 자원이 되고, 관련 벤치마크 개발에 기여하기를 바랍니다.

English

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.