模範公民：在網路安全中代表社群聲音

摘要

自動化有害語言檢測對於創建安全、包容的線上空間至關重要。然而，這是一項高度主觀的任務，對有害語言的感知往往受到社群規範和生活經驗的影響。現有的毒性檢測模型通常基於將多樣化的註解者觀點壓縮為單一「真實標籤」的註解數據進行訓練，這抹去了如重構語言等重要的情境特定毒性概念。為解決此問題，我們引入了MODELCITIZENS數據集，包含6.8K條社交媒體貼文及40K條跨多元身份群體的毒性註解。為捕捉社交媒體貼文中常見的對話情境對毒性的影響，我們利用LLM生成的對話情境對MODELCITIZENS的貼文進行了擴充。現有最先進的毒性檢測工具（如OpenAI Moderation API、GPT-o4-mini）在MODELCITIZENS上的表現欠佳，且在情境擴充的貼文上表現進一步下降。最後，我們發布了基於LLaMA和Gemma架構、在MODELCITIZENS上微調的LLAMACITIZEN-8B和GEMMACITIZEN-12B模型，在分佈內評估中分別比GPT-o4-mini高出5.5%。我們的研究結果強調了基於社群共識的註解與建模對於包容性內容審核的重要性。數據、模型及程式碼已公開於https://github.com/asuvarna31/modelcitizens。

English

Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.

模範公民：在網路安全中代表社群聲音

ModelCitizens: Representing Community Voices in Online Safety

摘要

Support