T-MAP：基於軌跡感知演化搜尋的大型語言模型代理紅隊測試

摘要

過往的紅隊演練主要聚焦於誘發大型語言模型（LLM）生成有害文本輸出，但此類方法未能捕捉到透過多步驟工具執行所產生的代理特定漏洞，特別是在快速發展的生態系統（如模型上下文協議MCP）中。為填補此空白，我們提出一種軌跡感知的演化搜尋方法T-MAP，該方法利用執行軌跡來引導對抗性提示的發現。我們的技術能自動生成不僅繞過安全防護機制、更能透過實際工具交互可靠實現有害目標的攻擊。在多種MCP環境中的實證評估表明，T-MAP在攻擊實現率（ARR）上顯著優於基準方法，且對前沿模型（包括GPT-5.2、Gemini-3-Pro、Qwen3.5與GLM-5）保持有效性，從而揭示了自主LLM代理中先前未被充分探索的脆弱性。

English

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

T-MAP：基於軌跡感知演化搜尋的大型語言模型代理紅隊測試

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

摘要

Support