ChatPaper.aiChatPaper

Yor-Sarc:针对低资源非洲语言的讽刺检测黄金标准数据集

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

February 21, 2026
作者: Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
cs.AI

摘要

反讽检测对计算语义学提出了基础性挑战,该任务要求模型能够解析字面含义与真实意图之间的差异。这一挑战在低资源语言中尤为突出,因为此类语言往往缺乏甚至完全没有标注数据集。我们推出Yor-Sarc——首个约鲁巴语反讽检测的黄金标准数据集,约鲁巴语是一种声调型尼日尔-刚果语系语言,使用人口超过五千万。该数据集包含436个标注实例,由三位来自不同方言背景的母语者采用专为约鲁巴语反讽设计的标注方案完成,该方案特别融入了文化因素考量。该协议包含语境敏感型解读和社区知情准则,并辅以详尽的标注者间一致性分析,以支持在其他非洲语言中的复现研究。我们实现了从显著到近乎完美的一致性水平(弗莱斯κ=0.7660;配对科恩κ=0.6732-0.8743),其中83.3%的实例达成全体一致共识。一组标注者对达到了近乎完美的一致性(κ=0.8743;原始一致率93.8%),超越了多项英文反讽研究报道的基准水平。其余16.7%的多数同意案例将作为软标签保留,用于不确定性感知建模。Yor-Sarc数据集(https://github.com/toheebadura/yor-sarc)有望推动针对非洲低资源语言的语义解读及文化感知型自然语言处理研究。
English
Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present Yor-Sarc, the first gold-standard dataset for sarcasm detection in Yorùbá, a tonal Niger-Congo language spoken by over 50 million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' κ= 0.7660; pairwise Cohen's κ= 0.6732--0.8743), with 83.3% unanimous consensus. One annotator pair achieved almost perfect agreement (κ= 0.8743; 93.8% raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining 16.7% majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarchttps://github.com/toheebadura/yor-sarc is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.
PDF02February 27, 2026