AdvEvo-MARL:通过多智能体强化学习中的对抗性协同进化塑造内在安全性
AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning
October 2, 2025
作者: Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu
cs.AI
摘要
基于大语言模型(LLM)的多智能体系统在规划、工具使用和角色协调方面表现出色,但其开放性和交互复杂性也使其易受越狱攻击、提示注入和对抗性协作的影响。现有防御措施主要分为两类:(i) 自我验证,即在执行前要求每个智能体预先过滤不安全指令;(ii) 外部防护模块,用于监控行为。前者往往效果不佳,因为单个智能体缺乏足够能力来检测跨智能体的不安全链和委托引发的风险;后者则增加了系统开销,并形成单点故障——一旦被攻破,整个系统的安全性将崩溃,而增加更多防护模块又会加剧成本和复杂性。为解决这些挑战,我们提出了AdvEvo-MARL,一种将安全性内化于任务智能体的协同进化多智能体强化学习框架。AdvEvo-MARL不依赖外部防护,而是在对抗学习环境中联合优化攻击者(生成不断演变的越狱提示)和防御者(训练任务智能体既完成任务又抵御攻击)。为稳定学习并促进协作,我们引入了一个公共基线用于优势估计:同一功能组内的智能体共享一个组级平均回报基线,从而实现更低方差的更新和更强的组内协调。在代表性攻击场景中,AdvEvo-MARL始终将攻击成功率(ASR)保持在20%以下,而基线方法最高可达38.33%,同时保持——有时甚至提升——任务准确率(在推理任务上最高提升+3.67%)。这些结果表明,无需依赖额外的防护智能体或增加系统开销,安全性和实用性可以共同提升。
English
LLM-based multi-agent systems excel at planning, tool use, and role
coordination, but their openness and interaction complexity also expose them to
jailbreak, prompt-injection, and adversarial collaboration. Existing defenses
fall into two lines: (i) self-verification that asks each agent to pre-filter
unsafe instructions before execution, and (ii) external guard modules that
police behaviors. The former often underperforms because a standalone agent
lacks sufficient capacity to detect cross-agent unsafe chains and
delegation-induced risks; the latter increases system overhead and creates a
single-point-of-failure-once compromised, system-wide safety collapses, and
adding more guards worsens cost and complexity. To solve these challenges, we
propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning
framework that internalizes safety into task agents. Rather than relying on
external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize
evolving jailbreak prompts) and defenders (task agents trained to both
accomplish their duties and resist attacks) in adversarial learning
environments. To stabilize learning and foster cooperation, we introduce a
public baseline for advantage estimation: agents within the same functional
group share a group-level mean-return baseline, enabling lower-variance updates
and stronger intra-group coordination. Across representative attack scenarios,
AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas
baselines reach up to 38.33%, while preserving-and sometimes improving-task
accuracy (up to +3.67% on reasoning tasks). These results show that safety and
utility can be jointly improved without relying on extra guard agents or added
system overhead.