ModelCitizens:在网络安全中代表社区声音
ModelCitizens: Representing Community Voices in Online Safety
July 7, 2025
作者: Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
cs.AI
摘要
自动检测有害言论对于构建安全、包容的网络空间至关重要。然而,这是一项高度主观的任务,对有害语言的认知深受社区规范和个人生活经历的影响。现有的有害言论检测模型通常基于将多样化的标注者观点简化为单一真实标签的注释进行训练,这抹去了诸如语言再使用等重要的情境化毒性概念。为解决这一问题,我们推出了MODELCITIZENS数据集,包含6.8K条社交媒体帖子和40K条跨越不同身份群体的毒性注释。为了捕捉社交媒体帖子中常见的对话语境对毒性的影响,我们利用大语言模型生成的对话场景对MODELCITIZENS中的帖子进行了增强。当前最先进的有害言论检测工具(如OpenAI Moderation API、GPT-o4-mini)在MODELCITIZENS上表现欠佳,在语境增强的帖子上表现进一步下降。最后,我们发布了基于LLaMA和Gemma微调的LLAMACITIZEN-8B和GEMMACITIZEN-12B模型,在分布内评估中比GPT-o4-mini高出5.5%。我们的研究结果强调了基于社区共识的注释和建模对于实现包容性内容审核的重要性。数据、模型及代码已发布于https://github.com/asuvarna31/modelcitizens。
English
Automatic toxic language detection is critical for creating safe, inclusive
online spaces. However, it is a highly subjective task, with perceptions of
toxic language shaped by community norms and lived experience. Existing
toxicity detection models are typically trained on annotations that collapse
diverse annotator perspectives into a single ground truth, erasing important
context-specific notions of toxicity such as reclaimed language. To address
this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K
toxicity annotations across diverse identity groups. To capture the role of
conversational context on toxicity, typical of social media posts, we augment
MODELCITIZENS posts with LLM-generated conversational scenarios.
State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API,
GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on
context-augmented posts. Finally, we release LLAMACITIZEN-8B and
GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS,
which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our
findings highlight the importance of community-informed annotation and modeling
for inclusive content moderation. The data, models and code are available at
https://github.com/asuvarna31/modelcitizens.