T-MAP：基于轨迹感知进化搜索的LLM智能体红队测试

摘要

以往的红队测试主要聚焦于诱导大型语言模型（LLMs）生成有害文本输出，但此类方法难以捕捉智能体在多层次工具执行过程中暴露的特有漏洞——尤其是在模型上下文协议（MCP）等快速发展的生态系统中。为弥补这一空白，我们提出了一种轨迹感知的进化搜索方法T-MAP，该方法通过执行轨迹引导对抗性提示的发现。我们的技术不仅能自动生成绕过安全防护机制的攻击，还能通过实际工具交互可靠地实现有害目标。在多样化MCP环境中的实证评估表明，T-MAP在攻击实现率（ARR）上显著优于基线方法，并对包括GPT-5.2、Gemini-3-Pro、Qwen3.5和GLM-5在内的前沿模型持续有效，由此揭示了自主LLM智能体中尚未被充分探索的安全隐患。

English

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

T-MAP：基于轨迹感知进化搜索的LLM智能体红队测试

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

摘要

Support