解碼信任代理平台（DTap）：一個可控且互動式的AI智能體紅隊測試平台

摘要

AI代理正日益廣泛部署於各種領域，透過長程且高風險的動作執行來自動化複雜的工作流程。由於其高度的能力與靈活性，這類代理引發了重大的安全與安保問題。越來越多的實際事件顯示，攻擊者可以輕易操縱代理執行有害行為，例如洩漏API金鑰、刪除用戶資料，或發起未經授權的交易。評估代理安全性本質上極具挑戰性，因為代理在動態且不可信的環境中運作，涉及外部工具、異質資料來源，以及頻繁的用戶互動。然而，用於大規模風險評估的真實、可控且可重現的環境，仍缺乏充分探索。為了解決這一缺口，我們引入了DecodingTrust-Agent Platform (DTap)，這是首個針對AI代理的可控且互動式紅隊測試平台，涵蓋14個真實世界領域及超過50個模擬環境，這些環境複製了廣泛使用的系統，如Google Workspace、Paypal和Slack。為了在DTap中擴大代理風險評估的規模，我們進一步提出了DTap-Red，這是首個自主紅隊測試代理，系統性地探索多樣化的注入向量（例如提示詞、工具、技能、環境、組合），並自主發現針對不同惡意目標的有效攻擊策略。透過DTap-Red，我們整理了DTap-Bench，這是一個大規模的紅隊測試資料集，包含跨領域的高品質實例，每個實例都配有可驗證的裁判，以自動驗證攻擊結果。透過DTap，我們對基於各種骨幹模型構建的流行AI代理進行了大規模評估，涵蓋安全政策、風險類別與攻擊策略，揭示了系統性的漏洞模式，並為開發安全的下一代代理提供了有價值的見解。

English

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

解碼信任代理平台（DTap）：一個可控且互動式的AI智能體紅隊測試平台

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

摘要

Support