ChatPaper.aiChatPaper

面向低资源语言的大规模语义数据集生成混合协议:土耳其语语义关系语料库

A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

January 19, 2026
作者: Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli, Mahmoud ElHussieni
cs.AI

摘要

我们提出了一种混合方法,用于生成低资源语言的大规模语义关系数据集,并通过构建完整的土耳其语语义关系语料库进行验证。该方法整合了三个阶段:(1) 利用FastText词向量与层次聚类识别语义簇;(2)采用Gemini 2.5-Flash进行自动化语义关系分类;(3)融合精编词典资源。最终数据集包含84.3万个土耳其语独特语义对,涵盖三种关系类型(同义词、反义词、共下位词),规模达到现有资源的10倍且成本极低(65美元)。我们通过两项下游任务验证数据质量:词向量模型实现90%的Top-1检索准确率,分类模型获得90%的宏观F1值。这一可扩展方案有效缓解了土耳其语自然语言处理面临的数据稀缺问题,并证明可推广至其他低资源语言。我们已公开数据集与相关模型。
English
We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
PDF11January 22, 2026