AdvEvo-MARL: 다중 에이전트 강화 학습에서 적대적 공진화를 통한 내재화된 안전성 형성

초록

LLM 기반 다중 에이전트 시스템은 계획 수립, 도구 사용, 역할 조정에서 뛰어난 성능을 보이지만, 개방성과 상호작용 복잡성으로 인해 탈옥(jailbreak), 프롬프트 주입(prompt-injection), 적대적 협업(adversarial collaboration) 등의 위험에 노출됩니다. 기존의 방어 기법은 두 가지 접근 방식으로 나뉩니다: (i) 각 에이전트가 실행 전에 안전하지 않은 명령을 사전 필터링하는 자기 검증(self-verification)과 (ii) 행동을 감시하는 외부 가드 모듈(external guard modules). 전자는 단독 에이전트가 에이전트 간의 안전하지 않은 연쇄 행위와 위임으로 인한 위험을 탐지하기에 충분한 역량이 부족하여 종종 성능이 떨어지며, 후자는 시스템 오버헤드를 증가시키고 단일 장애점(single-point-of-failure)을 생성합니다. 일단 침해되면 시스템 전체의 안전이 무너지며, 더 많은 가드를 추가하면 비용과 복잡성이 악화됩니다. 이러한 문제를 해결하기 위해, 우리는 안전성을 작업 에이전트 내부에 내재화하는 공진화 다중 에이전트 강화 학습(co-evolutionary multi-agent reinforcement learning) 프레임워크인 AdvEvo-MARL을 제안합니다. AdvEvo-MARL은 외부 가드에 의존하지 않고, 적대적 학습 환경에서 진화하는 탈옥 프롬프트를 합성하는 공격자와 자신의 임무를 수행하면서 공격에 저항하도록 훈련된 방어자(작업 에이전트)를 공동으로 최적화합니다. 학습 안정화와 협력을 촉진하기 위해, 우리는 이점 추정(advantage estimation)을 위한 공공 기준선(public baseline)을 도입합니다: 동일한 기능 그룹 내의 에이전트는 그룹 수준의 평균 수익 기준선을 공유하여, 더 낮은 분산의 업데이트와 강력한 그룹 내 조정을 가능하게 합니다. 대표적인 공격 시나리오에서 AdvEvo-MARL은 공격 성공률(ASR)을 20% 이하로 유지한 반면, 기준선은 최대 38.33%에 달했으며, 작업 정확도는 유지되거나 때로는 개선되었습니다(추론 작업에서 최대 +3.67%). 이러한 결과는 추가적인 가드 에이전트나 시스템 오버헤드 없이도 안전성과 유용성을 동시에 개선할 수 있음을 보여줍니다.

English

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

AdvEvo-MARL: 다중 에이전트 강화 학습에서 적대적 공진화를 통한 내재화된 안전성 형성

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

초록

Support