T-MAP: 軌道認識進化探索によるLLMエージェントのレッドチーミング

要旨

従来のレッドチーミング研究は、大規模言語モデル（LLM）から有害なテキスト出力を誘導することに焦点を当ててきたが、こうしたアプローチは、Model Context Protocol（MCP）のような急速に発展するエコシステムにおいて、マルチステップのツール実行を通じて生じるエージェント固有の脆弱性を捉えられていない。この課題を解決するため、我々は軌道認識型進化探索手法であるT-MAPを提案する。本手法は実行軌跡を活用して敵対的プロンプトの発見を導くもので、安全性ガードレールを回避するだけでなく、実際のツール相互作用を通じて有害な目的を確実に達成する攻撃の自動生成を可能にする。多様なMCP環境における実証評価では、T-MAPが攻撃実現率（ARR）においてベースライン手法を大幅に上回り、GPT-5.2、Gemini-3-Pro、Qwen3.5、GLM-5といった最先端モデルに対しても有効であることが示された。これにより、自律型LLMエージェントにおいて従来十分に検討されてこなかった脆弱性が明らかになった。

English

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

T-MAP: 軌道認識進化探索によるLLMエージェントのレッドチーミング

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

要旨

Support