TriPlay-RL:面向大型語言模型安全對齊的三角色自我博弈強化學習
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
January 26, 2026
作者: Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
cs.AI
摘要
近年來,大型語言模型的安全風險日益凸顯,減緩有毒有害內容生成的需求迫在眉睫。當前主流的安全對齊範式通常採用三方協作框架:攻擊者負責生成對抗性提示,防禦者實施安全防護,評估者進行回應評判。本文提出名為TriPlay-RL的閉環強化學習框架,可在近乎零人工標註的條件下實現三個角色的迭代式協同優化。實驗結果表明:攻擊者在保持高輸出多樣性的同時,對抗效果提升20%-50%;防禦者安全性能獲得10%-30%的增益,且不損害通用推理能力;評估者通過迭代持續優化細粒度判斷力,能精準區分不安全回應、簡單拒絕與有效指導。總體而言,本框架建立了高效可擴展的安全對齊範式,在統一學習循環中實現了持續的協同演化。
English
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.