DecodingTrust-Agent Platform (DTap): 통제 가능하고 상호작용적인 AI 에이전트 레드티밍 플랫폼

초록

AI 에이전트는 장기적인 고위험 작업 실행을 통해 복잡한 워크플로를 자동화하기 위해 다양한 도메인에 점점 더 많이 배포되고 있습니다. 이러한 에이전트는 높은 기능성과 유연성으로 인해 상당한 보안 및 안전 문제를 제기합니다. 증가하는 실제 사고 사례는 적대자가 API 키 유출, 사용자 데이터 삭제, 무단 거래 개시 등 유해한 행동을 수행하도록 에이전트를 쉽게 조작할 수 있음을 보여줍니다. 에이전트 보안 평가는 에이전트가 외부 도구, 이기종 데이터 소스 및 빈번한 사용자 상호작용을 포함하는 동적이고 신뢰할 수 없는 환경에서 작동하기 때문에 본질적으로 어렵습니다. 그러나 대규모 위험 평가를 위한 현실적이고 제어 가능하며 재현 가능한 환경은 아직 충분히 탐구되지 않았습니다. 이러한 격차를 해소하기 위해 우리는 DecodingTrust-Agent Platform (DTap)을 소개합니다. 이는 Google Workspace, PayPal, Slack과 같은 널리 사용되는 시스템을 재현한 14개의 실제 도메인과 50개 이상의 시뮬레이션 환경을 포괄하는 AI 에이전트를 위한 최초의 제어 가능하고 상호작용적인 레드팀 플랫폼입니다. DTap에서 에이전트의 위험 평가를 확장하기 위해 우리는 DTap-Red를 추가로 제안합니다. 이는 다양한 주입 벡터(예: 프롬프트, 도구, 스킬, 환경, 조합)를 체계적으로 탐색하고 다양한 악의적 목표에 맞춰 효과적인 공격 전략을 자율적으로 발견하는 최초의 자율 레드티밍 에이전트입니다. DTap-Red를 사용하여 우리는 DTap-Bench를 구축했습니다. 이는 도메인 전반에 걸친 고품질 인스턴스로 구성된 대규모 레드티밍 데이터셋으로, 각 인스턴스에는 공격 결과를 자동으로 검증하는 검증 가능한 판정자가 포함되어 있습니다. DTap을 통해 우리는 다양한 백본 모델로 구축된 인기 AI 에이전트를 보안 정책, 위험 범주, 공격 전략에 걸쳐 대규모 평가를 수행하여 체계적인 취약점 패턴을 밝혀내고 안전한 차세대 에이전트 개발을 위한 귀중한 통찰력을 제공합니다.

English

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

DecodingTrust-Agent Platform (DTap): 통제 가능하고 상호작용적인 AI 에이전트 레드티밍 플랫폼

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

초록

Support