<think> 因此，讓我們將這個詞替換為侮辱... </think> 從大型語言模型生成有毒文本中學到的教訓

摘要

現代大型語言模型（LLMs）在生成合成數據方面表現出色。然而，在文本去毒等敏感領域，其性能尚未得到科學界的充分關注。本文探討了使用LLM生成的合成有毒數據作為人類生成數據的替代方案，用於訓練去毒模型的可能性。利用Llama 3和Qwen激活修補模型，我們為ParaDetox和SST-2數據集中的中性文本生成了合成有毒對應物。我們的實驗表明，在合成數據上微調的模型始終表現不如在人類數據上訓練的模型，聯合指標性能下降高達30%。根本原因被確定為一個關鍵的詞彙多樣性差距：LLMs使用一小部分重複的侮辱詞彙生成有毒內容，未能捕捉到人類毒性的細微差別和多樣性。這些發現突顯了當前LLMs在這一領域的局限性，並強調了多樣化、人類註釋數據在構建穩健去毒系統中的持續重要性。

English

Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

<think> 因此，讓我們將這個詞替換為侮辱... </think> 從大型語言模型生成有毒文本中學到的教訓

<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs

摘要

Support