ChatPaper.aiChatPaper

Yor-Sarc:针对低资源非洲语言约鲁巴语的讽刺检测黄金标准数据集

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

February 21, 2026
作者: Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
cs.AI

摘要

反讽检测对计算语义学提出了根本性挑战,其核心在于模型需要解析字面含义与真实意图之间的差异。这一挑战在标注数据集稀缺的低资源语言中尤为突出。我们推出Yor-Sarc——首个约鲁巴语反讽检测黄金标准数据集,约鲁巴语是一种声调型尼日尔-刚果语系语言,使用人口超五千万。该数据集包含436个标注实例,由三位不同方言背景的母语者采用专为约鲁巴文化背景设计的反讽标注方案完成。该方案融合语境敏感解读和社区知情准则,并辅以标注者一致性全面分析以支持其他非洲语言的复现研究。标注者间达成显著至近乎完美的一致性(弗莱斯κ=0.7660;配对科恩κ=0.6732-0.8743),其中83.3%的实例获得全体一致。某标注对达成近乎完美的一致性(κ=0.8743;原始一致率93.8%),超过多项英语反讽研究的报告基准。其余16.7%的多数一致案例保留为软标签,用于不确定性感知建模。Yor-Sarc(https://github.com/toheebadura/yor-sarc)有望推动非洲低资源语言的语义解读及文化感知自然语言处理研究。
English
Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present Yor-Sarc, the first gold-standard dataset for sarcasm detection in Yorùbá, a tonal Niger-Congo language spoken by over 50 million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' κ= 0.7660; pairwise Cohen's κ= 0.6732--0.8743), with 83.3% unanimous consensus. One annotator pair achieved almost perfect agreement (κ= 0.8743; 93.8% raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining 16.7% majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarchttps://github.com/toheebadura/yor-sarc is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.
PDF02February 27, 2026