ChatPaper.aiChatPaper

毒性緩解的偏好調整在不同語言間具有普遍性

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

June 23, 2024
作者: Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach
cs.AI

摘要

由於多語言大型語言模型(LLMs)的全球使用不斷增加,對其進行排毒已變得至關重要。在這項工作中,我們探索了在排毒LLMs時的零樣本跨語言泛化偏好調整。與先前顯示其他安全任務的跨語言泛化有限的研究不同,我們證明只使用英文數據進行直接偏好優化(DPO)訓練可以顯著降低多語言開放式生成中的有害性。例如,在訓練後,mGPT-1.3B生成有害延續的概率從46.8%降至3.9%,跨越17種不同語言。我們的結果還適用於其他多語言LLMs,如BLOOM、Llama3和Aya-23。通過使用因果干預和激活分析等機械解釋工具,我們確定了LLMs中MLP層的雙多語性特性,這解釋了DPO的跨語言泛化。最後,我們展示了雙語句子檢索可以預測DPO偏好調整的跨語言可轉移性。
English
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Summary

AI-Generated Summary

PDF111November 29, 2024