DecodingTrust-Agent平台（DTap）：一种可控且交互式的AI智能体红队测试平台

摘要

AI智能体正越来越多地被部署到不同领域，通过长期、高风险的动作执行来自动化复杂工作流。由于其高能力与灵活性，这类智能体引发了重大的安全与保障问题。越来越多的真实世界事件表明，攻击者可以轻易操纵智能体执行有害行为，例如泄露API密钥、删除用户数据或发起未经授权的交易。评估智能体的安全性本身具有挑战性，因为智能体在动态、不可信的环境中运行，涉及外部工具、异构数据源以及频繁的用户交互。然而，可用于大规模风险评估的现实、可控且可复现的环境在很大程度上仍未被充分探索。为弥补这一不足，我们提出了解码信任智能体平台（DecodingTrust-Agent Platform，简称DTap）——首个面向AI智能体的可控且交互式红队测试平台，涵盖14个真实世界领域及50多个模拟环境，这些环境复制了Google Workspace、Paypal、Slack等广泛使用的系统。为在DTap中实现智能体风险评估的规模化，我们进一步提出了DTap-Red——首个自主红队测试智能体，它能系统地探索多种注入向量（如提示词、工具、技能、环境及其组合），并自主发现针对不同恶意目标的有效攻击策略。利用DTap-Red，我们精心构建了DTap-Bench——一个大规模红队测试数据集，包含跨领域的高质量实例，每个实例都配有可验证的评判器，以自动验证攻击结果。通过DTap，我们对基于多种骨干模型的流行AI智能体进行了大规模评估，涵盖安全策略、风险类别及攻击策略，揭示了系统性的脆弱性模式，并为开发安全的下一代智能体提供了宝贵见解。

English

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

DecodingTrust-Agent平台（DTap）：一种可控且交互式的AI智能体红队测试平台

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

摘要

Support