ModelCitizens: Het Vertegenwoordigen van Gemeenschapsstemmen in Online Veiligheid

Samenvatting

Automatische detectie van giftige taal is cruciaal voor het creëren van veilige, inclusieve online ruimtes. Het is echter een zeer subjectieve taak, waarbij percepties van giftige taal worden gevormd door gemeenschapsnormen en persoonlijke ervaringen. Bestaande modellen voor toxiciteitsdetectie worden doorgaans getraind op annotaties die diverse annotatorperspectieven samenvatten tot één enkele grondwaarheid, waardoor belangrijke context-specifieke noties van toxiciteit, zoals gereclaimde taal, worden uitgewist. Om dit aan te pakken, introduceren we MODELCITIZENS, een dataset van 6,8K sociale media posts en 40K toxiciteitsannotaties over diverse identiteitsgroepen. Om de rol van conversatiecontext op toxiciteit, typisch voor sociale media posts, vast te leggen, verrijken we MODELCITIZENS posts met LLM-gegenereerde conversatiescenario's. State-of-the-art toxiciteitsdetectietools (bijv. OpenAI Moderation API, GPT-o4-mini) presteren slechter op MODELCITIZENS, met verdere achteruitgang op context-verrijkte posts. Tot slot brengen we LLAMACITIZEN-8B en GEMMACITIZEN-12B uit, LLaMA- en Gemma-gebaseerde modellen die zijn afgestemd op MODELCITIZENS, en die GPT-o4-mini met 5,5% overtreffen op in-distributie-evaluaties. Onze bevindingen benadrukken het belang van gemeenschapsgeïnformeerde annotatie en modellering voor inclusieve contentmoderatie. De data, modellen en code zijn beschikbaar op https://github.com/asuvarna31/modelcitizens.

English

Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.

ModelCitizens: Het Vertegenwoordigen van Gemeenschapsstemmen in Online Veiligheid

ModelCitizens: Representing Community Voices in Online Safety

Samenvatting

Support