DuoGuard: 多言語LLMのための2プレイヤーRL駆動フレームワーク　Guardrails

要旨

大規模言語モデル（LLMs）の急速な進歩により、責任ある使用を確保するためのガードレールモデルの必要性が増しており、特に危険なコンテンツや違法コンテンツの検出において重要です。英語にはかなりの安全データが存在しますが、他言語のオープンソースの安全データが不足しているため、多言語ガードレールモデリングは未だ未開拓の領域です。このギャップを埋めるために、我々は新しい二人対戦型強化学習（RL）フレームワークを提案します。ここでは、ジェネレータとガードレールモデルが敵対的に共進化し、多言語ガードレールトレーニングのための高品質な合成データを生成します。この相互作用を二人対戦ゲームとして理論的に形式化し、ナッシュ均衡への収束を証明します。実証評価によると、我々のモデル「\ours」は最先端のモデルを凌駕し、英語のベンチマークでLlamaGuard3（8B）よりもほぼ10%の改善を達成し、推論時には4.5倍高速で、かつ大幅に小さなモデル（0.5B）です。収集された実データにおいて、低リソース言語の不均衡を解消する上で、多言語安全タスクにおいて大幅な進歩を達成します。削減研究は、英語と他言語のオープンソースデータの不均衡を埋めるために合成データ生成が果たす重要な役割を強調しています。これらの知見は、合成データ生成のための拡張可能で効率的なアプローチを確立し、LLMの安全性を向上させるための改良された多言語ガードレールモデルの道を開くものです。コード、モデル、データは https://github.com/yihedeng9/DuoGuard でオープンソース化されます。

English

The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.