정책 준수 에이전트의 효과적 레드 팀 테스트

초록

과제 지향적 LLM 기반 에이전트는 환불 자격이나 취소 규칙과 같이 엄격한 정책이 적용되는 영역에서 점점 더 많이 사용되고 있습니다. 여기서의 과제는 이러한 규칙과 정책을 에이전트가 일관되게 준수하고, 이를 위반하는 요청은 적절히 거부하면서도 도움이 되고 자연스러운 상호작용을 유지하는 데 있습니다. 이를 위해 악의적인 사용자 행동에 대비한 에이전트의 회복력을 보장하기 위한 맞춤형 설계 및 평가 방법론의 개발이 필요합니다. 우리는 개인적인 이익을 위해 정책 준수 에이전트를 악용하려는 적대적 사용자에 초점을 맞춘 새로운 위협 모델을 제안합니다. 이를 해결하기 위해, 우리는 고객 서비스 시나리오에서 정책 준수 에이전트를 약화시키기 위해 정책 인식 설득 전략을 활용하는 다중 에이전트 레드 팀 시스템인 CRAFT를 제시합니다. 이는 DAN 프롬프트, 감정 조작, 강압적 방법과 같은 기존의 탈옥 방법을 능가합니다. 기존의 tau-bench 벤치마크를 기반으로, 우리는 조작적인 사용자 행동에 대한 에이전트의 견고성을 엄격히 평가하기 위해 설계된 보완적 벤치마크인 tau-break를 소개합니다. 마지막으로, 우리는 몇 가지 간단하지만 효과적인 방어 전략을 평가합니다. 이러한 조치들은 일부 보호 기능을 제공하지만, 충분하지 않아 적대적 공격으로부터 정책 준수 에이전트를 보호하기 위한 더 강력한 연구 기반의 안전장치가 필요함을 강조합니다.

English

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks