理解与缓解图文预训练数据集中的毒性问题:以LLaVA为例的研究
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
May 9, 2025
作者: Karthik Reddy Kanjula, Surya Guthikonda, Nahid Alam, Shayekh Bin Islam
cs.AI
摘要
预训练数据集是多模态模型发展的基石,然而它们通常源自网络规模语料库,不可避免地带有固有偏见和有害内容。本文深入研究了LLaVA图文预训练数据集中有害内容的普遍性,探讨了这些有害内容在不同模态中的表现形式。我们针对常见的有害内容类别进行了全面分析,并提出了针对性的缓解策略,最终构建了一个经过优化的去毒数据集。该数据集从LLaVA预训练数据集中移除了7,531对有毒的图文配对。我们提供了实施稳健有害内容检测管道的指导原则。研究结果强调了主动识别和过滤有害内容——如仇恨言论、露骨图像和针对性骚扰——对于构建更负责任、更公平的多模态系统的必要性。该去毒数据集已开源,可供进一步研究使用。
English
Pretraining datasets are foundational to the development of multimodal
models, yet they often have inherent biases and toxic content from the
web-scale corpora they are sourced from. In this paper, we investigate the
prevalence of toxicity in LLaVA image-text pretraining dataset, examining how
harmful content manifests in different modalities. We present a comprehensive
analysis of common toxicity categories and propose targeted mitigation
strategies, resulting in the creation of a refined toxicity-mitigated dataset.
This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training
dataset. We offer guidelines for implementing robust toxicity detection
pipelines. Our findings underscore the need to actively identify and filter
toxic content - such as hate speech, explicit imagery, and targeted harassment
- to build more responsible and equitable multimodal systems. The
toxicity-mitigated dataset is open source and is available for further
research.Summary
AI-Generated Summary