<think> 그러면 이 구절을 모욕으로 대체해 보겠습니다... </think> LLM을 사용한 유해 텍스트 생성에서 얻은 교훈

초록

현대의 대형 언어 모델(LLM)은 합성 데이터 생성에 뛰어난 성능을 보입니다. 그러나 텍스트 비독성화와 같은 민감한 분야에서의 성능은 과학계로부터 충분한 주목을 받지 못했습니다. 본 논문은 비독성화 모델 훈련을 위해 인간이 생성한 데이터 대신 LLM이 생성한 합성 유해 데이터를 사용할 가능성을 탐구합니다. Llama 3와 Qwen 활성화 패치 모델을 사용하여 ParaDetox와 SST-2 데이터셋의 중립 텍스트에 대한 합성 유해 데이터를 생성했습니다. 실험 결과, 합성 데이터로 미세 조정된 모델은 인간 데이터로 훈련된 모델에 비해 지속적으로 낮은 성능을 보였으며, 공통 지표에서 최대 30%의 성능 하락이 관찰되었습니다. 근본 원인은 중요한 어휘 다양성 격차로 확인되었습니다: LLM은 모욕적인 단어의 작고 반복적인 어휘를 사용하여 유해 콘텐츠를 생성함으로써 인간의 유해성의 미묘함과 다양성을 포착하지 못합니다. 이러한 발견은 이 분야에서 현재 LLM의 한계를 강조하며, 견고한 비독성화 시스템 구축을 위해 다양하고 인간이 주석을 단 데이터의 지속적인 중요성을 강조합니다.

English

Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

<think> 그러면 이 구절을 모욕으로 대체해 보겠습니다... </think> LLM을 사용한 유해 텍스트 생성에서 얻은 교훈

<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs

초록

Support