模範公民:在網路安全中代表社群聲音
ModelCitizens: Representing Community Voices in Online Safety
July 7, 2025
作者: Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
cs.AI
摘要
自動化有害語言檢測對於創建安全、包容的線上空間至關重要。然而,這是一項高度主觀的任務,對有害語言的感知往往受到社群規範和生活經驗的影響。現有的毒性檢測模型通常基於將多樣化的註解者觀點壓縮為單一「真實標籤」的註解數據進行訓練,這抹去了如重構語言等重要的情境特定毒性概念。為解決此問題,我們引入了MODELCITIZENS數據集,包含6.8K條社交媒體貼文及40K條跨多元身份群體的毒性註解。為捕捉社交媒體貼文中常見的對話情境對毒性的影響,我們利用LLM生成的對話情境對MODELCITIZENS的貼文進行了擴充。現有最先進的毒性檢測工具(如OpenAI Moderation API、GPT-o4-mini)在MODELCITIZENS上的表現欠佳,且在情境擴充的貼文上表現進一步下降。最後,我們發布了基於LLaMA和Gemma架構、在MODELCITIZENS上微調的LLAMACITIZEN-8B和GEMMACITIZEN-12B模型,在分佈內評估中分別比GPT-o4-mini高出5.5%。我們的研究結果強調了基於社群共識的註解與建模對於包容性內容審核的重要性。數據、模型及程式碼已公開於https://github.com/asuvarna31/modelcitizens。
English
Automatic toxic language detection is critical for creating safe, inclusive
online spaces. However, it is a highly subjective task, with perceptions of
toxic language shaped by community norms and lived experience. Existing
toxicity detection models are typically trained on annotations that collapse
diverse annotator perspectives into a single ground truth, erasing important
context-specific notions of toxicity such as reclaimed language. To address
this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K
toxicity annotations across diverse identity groups. To capture the role of
conversational context on toxicity, typical of social media posts, we augment
MODELCITIZENS posts with LLM-generated conversational scenarios.
State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API,
GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on
context-augmented posts. Finally, we release LLAMACITIZEN-8B and
GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS,
which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our
findings highlight the importance of community-informed annotation and modeling
for inclusive content moderation. The data, models and code are available at
https://github.com/asuvarna31/modelcitizens.