DecodingTrust-Agent Platform (DTap): AIエージェントのための制御可能でインタラクティブなレッドチーミングプラットフォーム

要旨

AIエージェントは、長期にわたる高リスクなアクション実行を通じて複雑なワークフローを自動化するため、多様な領域でますます広く展開されるようになっている。その高い能力と柔軟性ゆえに、こうしたエージェントは重大なセキュリティおよび安全性の問題を引き起こす。現実のインシデントの増加は、攻撃者がエージェントを容易に操作し、APIキーの漏洩、ユーザーデータの削除、認可されていない取引の開始などの有害な行動を引き起こし得ることを示している。エージェントのセキュリティ評価は、エージェントが外部ツール、異種データソース、頻繁なユーザーとの相互作用を伴う動的で信頼できない環境で動作するため、本質的に困難である。しかしながら、大規模なリスク評価のための現実的で制御可能かつ再現可能な環境は、依然としてほとんど未開拓の状態である。このギャップに対処するため、我々はDecodingTrust-Agent Platform (DTap)を導入する。これは、AIエージェントのための初の制御可能でインタラクティブなレッドチーミングプラットフォームであり、Google Workspace、Paypal、Slackなどの広く利用されているシステムを再現した14の現実世界領域と50以上のシミュレーション環境を網羅する。DTapにおけるエージェントのリスク評価を大規模化するために、我々はさらにDTap-Redを提案する。これは、初の自律型レッドチーミングエージェントであり、多様な注入ベクトル（例：プロンプト、ツール、スキル、環境、それらの組み合わせ）を体系的に探索し、様々な悪意ある目標に合わせた効果的な攻撃戦略を自律的に発見する。DTap-Redを用いて、我々はDTap-Benchをキュレーションした。これは、各領域にわたる高品質なインスタンスからなる大規模なレッドチーミングデータセットであり、各インスタンスには攻撃結果を自動的に検証する検証可能な判定器が付随する。DTapを通じて、我々は様々なバックボーンモデル上に構築された一般的なAIエージェントの大規模評価を、セキュリティポリシー、リスクカテゴリ、攻撃戦略にわたって実施し、体系的な脆弱性パターンを明らかにするとともに、セキュアな次世代エージェント開発のための貴重な知見を提供する。

English

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

DecodingTrust-Agent Platform (DTap): AIエージェントのための制御可能でインタラクティブなレッドチーミングプラットフォーム

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

要旨

Support