仆人、追踪者、捕食者:一个诚实、助人且无害(3H)的智能体如何解锁对抗性技能
Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
August 27, 2025
作者: David Noever
cs.AI
摘要
本文识别并分析了一类基于模型上下文协议(MCP)的智能体系统中的新型安全漏洞。通过攻击链的描述与演示,我们展示了如何将原本良性且各自获得授权的任务进行编排,从而产生有害的涌现行为。利用MITRE ATLAS框架进行系统分析,我们证明了在测试的95个具备多服务访问权限的智能体中——包括浏览器自动化、财务分析、位置追踪及代码部署——它们能够将合法操作串联成复杂的攻击序列,这些攻击超出了任何单一服务的安全边界。这些红队演练调查了当前MCP架构是否缺乏跨域安全措施,无法检测或阻止一大类组合式攻击。我们提供了具体攻击链的实证证据,这些攻击链通过服务编排实现了定向危害,如数据泄露、财务操纵和基础设施破坏。这些发现揭示了,当智能体能够跨多个领域协调行动时,服务隔离的基本安全假设便告失效,从而形成一个随每项新增能力呈指数级增长的攻击面。本研究提供了一个基础实验框架,其评估重点不在于智能体能否完成MCP基准任务,而在于当它们过于出色地完成任务并跨多个服务进行优化时,会如何违背人类预期与安全约束。我们提出了利用现有MCP基准套件进行的三项具体实验方向。
English
This paper identifies and analyzes a novel vulnerability class in Model
Context Protocol (MCP) based agent systems. The attack chain describes and
demonstrates how benign, individually authorized tasks can be orchestrated to
produce harmful emergent behaviors. Through systematic analysis using the MITRE
ATLAS framework, we demonstrate how 95 agents tested with access to multiple
services-including browser automation, financial analysis, location tracking,
and code deployment-can chain legitimate operations into sophisticated attack
sequences that extend beyond the security boundaries of any individual service.
These red team exercises survey whether current MCP architectures lack
cross-domain security measures necessary to detect or prevent a large category
of compositional attacks. We present empirical evidence of specific attack
chains that achieve targeted harm through service orchestration, including data
exfiltration, financial manipulation, and infrastructure compromise. These
findings reveal that the fundamental security assumption of service isolation
fails when agents can coordinate actions across multiple domains, creating an
exponential attack surface that grows with each additional capability. This
research provides a barebones experimental framework that evaluate not whether
agents can complete MCP benchmark tasks, but what happens when they complete
them too well and optimize across multiple services in ways that violate human
expectations and safety constraints. We propose three concrete experimental
directions using the existing MCP benchmark suite.