DuoGuard:一個以雙玩家強化學習驅動的多語言LLM護欄框架
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
February 7, 2025
作者: Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
cs.AI
摘要
大型語言模型(LLMs)的快速發展增加了對護欄模型的需求,以確保負責任的使用,特別是在檢測不安全和非法內容方面。儘管在英語中存在大量安全數據,但由於其他語言的開源安全數據稀缺,多語言護欄建模仍未得到充分探索。為了填補這一空白,我們提出了一種新穎的雙人強化學習(RL)框架,其中生成器和護欄模型在對抗性演化中共同產生高質量的多語言護欄訓練合成數據。我們在理論上將這種互動形式化為一種雙人遊戲,證明收斂到納什均衡。實證評估表明,我們的模型「DuoGuard」在英語基準上優於最先進的模型,性能提高了近10%,同時在推理速度上快了4.5倍,並且使用了明顯更小的模型(0.5B)。我們在多語言安全任務方面取得了重大進展,特別是在處理真實收集的數據中低資源語言的不平衡。消融研究強調了合成數據生成在彌合英語和其他語言之間的開源數據不平衡中的關鍵作用。這些發現建立了一種可擴展且高效的合成數據生成方法,為改進多語言護欄模型以增強LLM安全性打下基礎。代碼、模型和數據將在https://github.com/yihedeng9/DuoGuard 上開源。
English
The rapid advancement of large language models (LLMs) has increased the need
for guardrail models to ensure responsible use, particularly in detecting
unsafe and illegal content. While substantial safety data exist in English,
multilingual guardrail modeling remains underexplored due to the scarcity of
open-source safety data in other languages. To address this gap, we propose a
novel two-player Reinforcement Learning (RL) framework, where a generator and a
guardrail model co-evolve adversarially to produce high-quality synthetic data
for multilingual guardrail training. We theoretically formalize this
interaction as a two-player game, proving convergence to a Nash equilibrium.
Empirical evaluations show that our model \ours outperforms state-of-the-art
models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English
benchmarks while being 4.5x faster at inference with a significantly smaller
model (0.5B). We achieve substantial advancements in multilingual safety tasks,
particularly in addressing the imbalance for lower-resource languages in a
collected real dataset. Ablation studies emphasize the critical role of
synthetic data generation in bridging the imbalance in open-source data between
English and other languages. These findings establish a scalable and efficient
approach to synthetic data generation, paving the way for improved multilingual
guardrail models to enhance LLM safety. Code, model, and data will be
open-sourced at https://github.com/yihedeng9/DuoGuard.Summary
AI-Generated Summary