理解与缓解图文预训练数据集中的毒性问题：以LLaVA为例的研究

摘要

预训练数据集是多模态模型发展的基石，然而它们通常源自网络规模语料库，不可避免地带有固有偏见和有害内容。本文深入研究了LLaVA图文预训练数据集中有害内容的普遍性，探讨了这些有害内容在不同模态中的表现形式。我们针对常见的有害内容类别进行了全面分析，并提出了针对性的缓解策略，最终构建了一个经过优化的去毒数据集。该数据集从LLaVA预训练数据集中移除了7,531对有毒的图文配对。我们提供了实施稳健有害内容检测管道的指导原则。研究结果强调了主动识别和过滤有害内容——如仇恨言论、露骨图像和针对性骚扰——对于构建更负责任、更公平的多模态系统的必要性。该去毒数据集已开源，可供进一步研究使用。

English

Pretraining datasets are foundational to the development of multimodal models, yet they often have inherent biases and toxic content from the web-scale corpora they are sourced from. In this paper, we investigate the prevalence of toxicity in LLaVA image-text pretraining dataset, examining how harmful content manifests in different modalities. We present a comprehensive analysis of common toxicity categories and propose targeted mitigation strategies, resulting in the creation of a refined toxicity-mitigated dataset. This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset. We offer guidelines for implementing robust toxicity detection pipelines. Our findings underscore the need to actively identify and filter toxic content - such as hate speech, explicit imagery, and targeted harassment - to build more responsible and equitable multimodal systems. The toxicity-mitigated dataset is open source and is available for further research.

理解与缓解图文预训练数据集中的毒性问题：以LLaVA为例的研究

Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

摘要

Support