ポリシー準拠エージェントの効果的なレッドチーミング

要旨

タスク指向のLLMベースのエージェントは、返金適格性やキャンセルルールなど、厳格なポリシーが存在する領域でますます使用されています。課題は、エージェントがこれらのルールやポリシーを一貫して遵守し、それらに違反するリクエストを適切に拒否しながらも、有用で自然なインタラクションを維持することにあります。これには、悪意のあるユーザー行動に対するエージェントの耐性を確保するための、特化した設計および評価方法論の開発が必要です。私たちは、個人の利益のためにポリシー遵守型エージェントを悪用しようとする敵対的ユーザーに焦点を当てた新しい脅威モデルを提案します。これに対処するため、CRAFTというマルチエージェントのレッドチーミングシステムを紹介します。CRAFTは、ポリシーを意識した説得戦略を活用して、カスタマーサービスシナリオにおけるポリシー遵守型エージェントを弱体化させ、DANプロンプト、感情操作、強制といった従来のジェイルブレイク手法を上回る性能を発揮します。既存のtau-benchベンチマークを基に、tau-breakという補完的なベンチマークを導入し、操作的なユーザー行動に対するエージェントの堅牢性を厳密に評価します。最後に、いくつかのシンプルでありながら効果的な防御戦略を評価します。これらの対策はある程度の保護を提供しますが、不十分であり、ポリシー遵守型エージェントを敵対的攻撃から守るための、より強力な研究主導のセーフガードの必要性が浮き彫りになります。

English

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

ポリシー準拠エージェントの効果的なレッドチーミング

Effective Red-Teaming of Policy-Adherent Agents

要旨

Support