ChatPaper.aiChatPaper

低资源语言大规模语义数据集生成的混合协议:土耳其语语义关系语料库

A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

January 19, 2026
作者: Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli, Mahmoud ElHussieni
cs.AI

摘要

我们提出一种面向低资源语言的大规模语义关系数据集生成混合方法,并以土耳其语语义关系语料库为例进行验证。该方法整合了三个阶段:(1) 采用FastText词嵌入与凝聚层次聚类识别语义簇;(2)利用Gemini 2.5-Flash进行自动化语义关系分类;(3)融合精编词典资源。最终数据集包含84.3万个土耳其语唯一语义对,涵盖三种关系类型(同义、反义、共类义),规模达到现有资源的10倍且成本极低(65美元)。我们通过两项下游任务验证数据质量:词嵌入模型实现90%的Top-1检索准确率,分类模型获得90%的宏观F1分数。这套可扩展方案有效缓解了土耳其语自然语言处理面临的数据稀缺问题,并证明可推广至其他低资源语言。我们公开释放数据集与相关模型。
English
We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
PDF11January 22, 2026