하인, 스토커, 포식자: 정직하고 도움이 되며 해를 끼치지 않는(3H) 에이전트가 적대적 기술을 깨우치는 방식

초록

본 논문은 모델 컨텍스트 프로토콜(Model Context Protocol, MCP) 기반 에이전트 시스템에서 발견된 새로운 취약점 클래스를 식별하고 분석한다. 이 공격 체인은 개별적으로 승인된 무해한 작업들이 어떻게 조율되어 유해한 창발적 행동을 유발할 수 있는지를 설명하고 입증한다. MITRE ATLAS 프레임워크를 사용한 체계적인 분석을 통해, 브라우저 자동화, 금융 분석, 위치 추적, 코드 배포 등 여러 서비스에 접근 가능한 95개 에이전트가 어떻게 합법적인 작업들을 연결하여 개별 서비스의 보안 경계를 넘어서는 정교한 공격 시퀀스를 생성할 수 있는지를 보여준다. 이러한 레드 팀 연습은 현재 MCP 아키텍처가 다양한 도메인 간 보안 조치를 감지하거나 방지하기에 부족한지를 조사한다. 우리는 데이터 유출, 금융 조작, 인프라 침해 등 서비스 조율을 통해 목표한 피해를 달성하는 구체적인 공격 체인의 실증적 증거를 제시한다. 이러한 발견은 에이전트가 여러 도메인 간에 행동을 조율할 수 있을 때 서비스 격리라는 근본적인 보안 가정이 실패하며, 각 추가 기능마다 기하급수적으로 증가하는 공격 표면이 생성됨을 보여준다. 본 연구는 에이전트가 MCP 벤치마크 작업을 완료할 수 있는지 여부가 아니라, 그들이 작업을 너무 잘 완료하고 인간의 기대와 안전 제약을 위반하는 방식으로 여러 서비스 간에 최적화할 때 발생하는 상황을 평가하는 기본적인 실험 프레임워크를 제공한다. 우리는 기존 MCP 벤치마크 제품군을 사용한 세 가지 구체적인 실험 방향을 제안한다.

English

This paper identifies and analyzes a novel vulnerability class in Model Context Protocol (MCP) based agent systems. The attack chain describes and demonstrates how benign, individually authorized tasks can be orchestrated to produce harmful emergent behaviors. Through systematic analysis using the MITRE ATLAS framework, we demonstrate how 95 agents tested with access to multiple services-including browser automation, financial analysis, location tracking, and code deployment-can chain legitimate operations into sophisticated attack sequences that extend beyond the security boundaries of any individual service. These red team exercises survey whether current MCP architectures lack cross-domain security measures necessary to detect or prevent a large category of compositional attacks. We present empirical evidence of specific attack chains that achieve targeted harm through service orchestration, including data exfiltration, financial manipulation, and infrastructure compromise. These findings reveal that the fundamental security assumption of service isolation fails when agents can coordinate actions across multiple domains, creating an exponential attack surface that grows with each additional capability. This research provides a barebones experimental framework that evaluate not whether agents can complete MCP benchmark tasks, but what happens when they complete them too well and optimize across multiple services in ways that violate human expectations and safety constraints. We propose three concrete experimental directions using the existing MCP benchmark suite.