CMHG:中国少数民族语言标题生成数据集与基准测试
CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
September 12, 2025
作者: Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
cs.AI
摘要
中国的少数民族语言,如藏语、维吾尔语和传统蒙古语,因其独特的书写体系与国际标准存在差异而面临重大挑战。这一差异导致了相关语料库的严重匮乏,尤其是在标题生成等监督任务方面。为填补这一空白,我们推出了一个新颖的数据集——中国少数民族标题生成(CMHG),其中包含10万条藏语条目,以及各5万条的维吾尔语和蒙古语条目,专门为标题生成任务而设计。此外,我们提出了一套由母语者标注的高质量测试集,旨在为未来该领域的研究设立基准。我们期望这一数据集能成为推动中国少数民族语言标题生成研究的重要资源,并为相关基准的发展做出贡献。
English
Minority languages in China, such as Tibetan, Uyghur, and Traditional
Mongolian, face significant challenges due to their unique writing systems,
which differ from international standards. This discrepancy has led to a severe
lack of relevant corpora, particularly for supervised tasks like headline
generation. To address this gap, we introduce a novel dataset, Chinese Minority
Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and
50,000 entries each for Uyghur and Mongolian, specifically curated for headline
generation tasks. Additionally, we propose a high-quality test set annotated by
native speakers, designed to serve as a benchmark for future research in this
domain. We hope this dataset will become a valuable resource for advancing
headline generation in Chinese minority languages and contribute to the
development of related benchmarks.