CMHG：中国少数民族语言标题生成数据集与基准测试

摘要

中国的少数民族语言，如藏语、维吾尔语和传统蒙古语，因其独特的书写体系与国际标准存在差异而面临重大挑战。这一差异导致了相关语料库的严重匮乏，尤其是在标题生成等监督任务方面。为填补这一空白，我们推出了一个新颖的数据集——中国少数民族标题生成（CMHG），其中包含10万条藏语条目，以及各5万条的维吾尔语和蒙古语条目，专门为标题生成任务而设计。此外，我们提出了一套由母语者标注的高质量测试集，旨在为未来该领域的研究设立基准。我们期望这一数据集能成为推动中国少数民族语言标题生成研究的重要资源，并为相关基准的发展做出贡献。

English

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.