독성 완화를 위한 선호도 튜닝은 언어 간 일반화 가능

초록

다국어 대규모 언어 모델(LLM)의 해독화는 전 세계적으로 사용이 증가함에 따라 중요한 과제로 부상했습니다. 본 연구에서는 LLM 해독화를 위한 선호도 튜닝의 제로샷 교차 언어 일반화를 탐구합니다. 기존 연구들이 다른 안전성 작업에 대해 제한된 교차 언어 일반화를 보여준 것과 달리, 우리는 영어 데이터만으로 Direct Preference Optimization(DPO) 훈련을 수행하면 다국어 자유 생성에서 독성 수준을 크게 감소시킬 수 있음을 입증했습니다. 예를 들어, mGPT-1.3B가 독성 문장을 생성할 확률은 훈련 후 17개 언어에서 46.8%에서 3.9%로 감소했습니다. 이러한 결과는 BLOOM, Llama3, Aya-23과 같은 다른 다국어 LLM에도 적용됩니다. 인과적 개입 및 활성화 분석과 같은 기계적 해석 도구를 사용하여, 우리는 LLM의 MLP 계층이 지닌 이중 다국어 특성을 확인했으며, 이는 DPO의 교차 언어 일반화를 설명합니다. 마지막으로, 이중 언어 문장 검색이 DPO 선호도 튜닝의 교차 언어 전이 가능성을 예측할 수 있음을 보여줍니다.

English

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

독성 완화를 위한 선호도 튜닝은 언어 간 일반화 가능

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

초록

Support