政策遵循型智能体的有效红队测试

摘要

面向任务的基于大语言模型（LLM）的智能体正日益应用于政策严格的领域，如退款资格或取消规则等。核心挑战在于确保智能体始终遵循这些规则与政策，恰当拒绝任何可能违反规定的请求，同时保持互动过程的自然与助益性。这要求开发专门的设计与评估方法，以确保智能体能够抵御恶意用户行为。我们提出了一种新颖的威胁模型，聚焦于那些试图利用政策遵循型智能体谋取私利的对抗性用户。为此，我们介绍了CRAFT，一个多智能体红队系统，它运用政策感知的说服策略，在客户服务场景中瓦解政策遵循型智能体，其表现超越了传统的越狱方法，如DAN提示、情感操控及胁迫手段。基于现有的tau-bench基准，我们引入了tau-break，这一补充基准旨在严格评估智能体对操纵性用户行为的鲁棒性。最后，我们评估了几种简单却有效的防御策略。尽管这些措施提供了一定程度的保护，但仍显不足，凸显了需要更强有力、基于研究的防护机制，以保护政策遵循型智能体免受对抗性攻击。

English

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks