ChatPaper.aiChatPaper

毒性缓解的偏好调整在不同语言间具有普适性

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

June 23, 2024
作者: Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach
cs.AI

摘要

由于多语言大型语言模型(LLMs)的全球使用不断增加,对其进行去毒化处理变得至关重要。在这项工作中,我们探讨了在去毒化LLMs过程中的零-shot跨语言泛化偏好调整。与先前研究表明其他安全任务的跨语言泛化有限不同,我们展示了仅使用英语数据进行的直接偏好优化(DPO)训练可以显著降低多语言开放生成中的有害性。例如,在训练后,mGPT-1.3B生成有害延续的概率从46.8%降至3.9%,跨17种不同语言。我们的结果还适用于其他多语言LLMs,如BLOOM、Llama3和Aya-23。通过使用因果干预和激活分析等机械解释工具,我们确定了LLMs中MLP层的双语言特性,解释了DPO的跨语言泛化。最后,我们展示了双语句子检索可以预测DPO偏好调整的跨语言可转移性。
English
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Summary

AI-Generated Summary

PDF111November 29, 2024