T-MAP: 궤적 인지 진화 탐색을 통한 LLM 에이전트 레드팀링

초록

기존의 레드팀링 연구가 대규모 언어 모델(LLM)로부터 유해한 텍스트 출력을 유도하는 데 집중해 온 반면, 이러한 접근 방식은 Model Context Protocol (MCP)과 같이 빠르게 성장하는 생태계에서 다단계 도구 실행을 통해 나타나는 에이전트 특화 취약점을 포착하지 못합니다. 이러한 격차를 해결하기 위해 우리는 실행 궤적을 활용하여 적대적 프롬프트 발견을 안내하는 궤적 인식 진화 탐색 방법론인 T-MAP을 제안합니다. 우리의 접근 방식은 안전 장치를 우회할 뿐만 아니라 실제 도구 상호작용을 통해 유해한 목표를 안정적으로 실현하는 공격의 자동 생성을 가능하게 합니다. 다양한 MCP 환경에서의 실증적 평가 결과, T-MAP은 공격 성공률(ARR)에서 기준 방법들을 상당히 능가하며 GPT-5.2, Gemini-3-Pro, Qwen3.5, GLM-5 등 최첨단 모델에 대해서도 효과적이어서, 자율 LLM 에이전트의 기존에 충분히 탐구되지 않았던 취약점을 드러냅니다.

English

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

T-MAP: 궤적 인지 진화 탐색을 통한 LLM 에이전트 레드팀링

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

초록

Support