毒性軽減のための嗜好チューニングは言語を超えて汎化する

要旨

多言語大規模言語モデル（LLM）の脱毒性化は、その世界的な使用の増加に伴い重要な課題となっている。本研究では、LLMの脱毒性化における選好チューニングのゼロショット・クロスリンガル汎化を探求する。他の安全性タスクではクロスリンガル汎化が限定的であることを示した先行研究とは異なり、英語データのみを用いたDirect Preference Optimization（DPO）トレーニングが、多言語オープンエンド生成における毒性を大幅に低減できることを実証する。例えば、mGPT-1.3Bが毒性のある続きを生成する確率は、トレーニング後、17の異なる言語において46.8%から3.9%に低下した。この結果は、BLOOM、Llama3、Aya-23などの他の多言語LLMにも拡張される。因果的介入や活性化分析といったメカニズム的解釈ツールを用いて、LLMのMLP層が持つ二重多言語性という特性を特定し、これがDPOのクロスリンガル汎化を説明することを明らかにした。最後に、二言語文検索がDPO選好チューニングのクロスリンガル転移性を予測できることを示す。

English

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

毒性軽減のための嗜好チューニングは言語を超えて汎化する

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

要旨

Support