政策遵循型代理的有效紅隊測試

摘要

基於大型語言模型（LLM）的任務導向型代理，在退款資格或取消規則等嚴格政策的領域中日益普及。其挑戰在於確保代理始終遵守這些規則與政策，適當地拒絕任何可能違反規定的請求，同時仍保持自然且有益的互動。這要求開發量身定制的設計與評估方法，以確保代理對抗惡意用戶行為的韌性。我們提出了一種新穎的威脅模型，專注於那些企圖利用政策遵循型代理以謀取個人利益的對抗性用戶。為此，我們介紹了CRAFT，這是一個多代理紅隊系統，它利用政策感知的說服策略，在客戶服務場景中削弱政策遵循型代理，其表現超越了傳統的越獄方法，如DAN提示、情感操控及強制手段。基於現有的tau-bench基準，我們引入了tau-break，這是一個補充性基準，旨在嚴格評估代理對抗操縱性用戶行為的魯棒性。最後，我們評估了幾種直接但有效的防禦策略。雖然這些措施提供了一定程度的保護，但仍顯不足，凸顯了需要更強、基於研究的安全保障，以保護政策遵循型代理免受對抗性攻擊。

English

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks