ChatPaper.aiChatPaper

CMHG:中國少數民族語言標題生成數據集與基準測試

CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

September 12, 2025
作者: Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
cs.AI

摘要

中國的少數民族語言,如藏語、維吾爾語和傳統蒙古語,因其獨特的書寫系統與國際標準不同而面臨重大挑戰。這種差異導致了相關語料庫的嚴重缺乏,特別是在監督式任務如標題生成方面。為填補這一空白,我們引入了一個新穎的數據集——中國少數民族標題生成(CMHG),其中包含10萬條藏語條目,以及各5萬條的維吾爾語和蒙古語條目,專門為標題生成任務而精心策劃。此外,我們提出了一個由母語者註釋的高質量測試集,旨在為該領域的未來研究提供基準。我們希望這一數據集能成為推動中國少數民族語言標題生成發展的寶貴資源,並為相關基準的開發做出貢獻。
English
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
PDF24January 19, 2026