AdvEvo-MARL：通過多智能體強化學習中的對抗性共同演化塑造內化安全

摘要

基於LLM的多智能體系統在規劃、工具使用及角色協調方面表現卓越，然而其開放性與交互複雜性也使其易受越獄、提示注入及對抗性協作等威脅。現有的防禦策略主要分為兩類：(i)自我驗證，要求每個智能體在執行前預先過濾不安全指令；(ii)外部守護模組，負責監控行為。前者常因單一智能體缺乏足夠能力檢測跨智能體的不安全鏈條及委託引發的風險而表現不佳；後者則增加了系統開銷，並形成單點故障——一旦被攻破，整個系統的安全性便會崩潰，且增加更多守護者會加劇成本與複雜性。為解決這些挑戰，我們提出了AdvEvo-MARL，一種將安全性內化於任務智能體的共進化多智能體強化學習框架。AdvEvo-MARL不依賴外部守護，而是在對抗性學習環境中聯合優化攻擊者（合成不斷進化的越獄提示）與防禦者（訓練完成任務並抵抗攻擊的任務智能體）。為穩定學習並促進合作，我們引入了一個公共基線用於優勢估計：同一功能組內的智能體共享一個組級平均回報基線，從而實現更低方差的更新與更強的組內協調。在代表性攻擊場景中，AdvEvo-MARL始終將攻擊成功率（ASR）控制在20%以下，而基線方法最高可達38.33%，同時保持甚至提升了任務準確率（在推理任務上最高提升+3.67%）。這些結果表明，無需依賴額外的守護智能體或增加系統開銷，安全與效用即可共同提升。

English

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

AdvEvo-MARL：通過多智能體強化學習中的對抗性共同演化塑造內化安全

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

摘要

Support