TriPlay-RL:面向大语言模型安全对齐的三角色自我博弈强化学习
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
January 26, 2026
作者: Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
cs.AI
摘要
近年来,大语言模型的安全风险日益凸显,遏制有害内容生成的需求迫在眉睫。当前主流的安全对齐范式通常采用三方协作框架:负责生成对抗提示的攻击者、实施安全防护的防御者,以及进行响应评估的评判者。本文提出名为TriPlay-RL的闭环强化学习框架,可在近乎零人工标注的条件下实现三方角色的迭代式协同进化。实验表明:攻击者在保持高输出多样性的同时,对抗有效性提升20%-50%;防御者安全性能获得10%-30%的增益,且不影响通用推理能力;评判者通过迭代持续优化细粒度判别能力,可精准区分不安全回复、简单拒绝与有效引导。该框架构建了高效可扩展的安全对齐新范式,在统一学习循环中实现了持续协同进化。
English
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.