SkillHarm：通过自动化构建的生命周期感知技能攻击

摘要

智能体技能在智能体工作流中占据特殊地位，因为智能体预期会隐式地遵循并执行这些技能，这使得第三方技能成为易受攻击的薄弱环节。现有研究已揭示由技能攻击引发的智能体不安全行为，但这些研究主要针对单次任务执行中的投毒技能进行评估，并通过临时构建的风险列表枚举危害。为弥补这些不足，我们提出了SkillHarm——一个覆盖技能使用全生命周期的技能攻击基准，并配以一套系统化的技能相关风险分类体系。SkillHarm评估两种攻击场景：固定载荷投毒（FPP），即一个固定的投毒技能包直接危害任何调用它的任务会话；以及自变异投毒（SMP），即一个初始无害的执行过程悄然改变持久的技能内容，将危害延迟至后续复用时才显现。该基准进一步根据危害所针对的智能体工作流组件定义了12种风险类型：数据管道、系统环境和智能体自主性。为实现大规模攻击实例化，我们构建了AutoSkillHarm——一个由自然语言驱动编码智能体的自动化构建流水线。最终基准包含跨越71个技能的879个攻击样本。实验表明，当前智能体仍存在脆弱性，FPP攻击成功率高达86.3%，SMP攻击成功率达69.3%。我们的分析进一步揭示了一个潜在风险：许多表面上的攻击失败实际上源于智能体未能与被投毒文件交互，而非真正的抵抗能力；且现有防御措施仍无法可靠地缓解这一威胁。

English

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.