SkillHarm：透過自動化建構之生命週期感知型技能攻擊

摘要

技能在智能体工作流中占据特殊地位，因为智能体需隐式遵循并执行这些技能，这使得第三方技能成为易受攻击的薄弱环节。现有研究虽已揭示基于技能的攻击所引发的智能体不安全行为，但主要针对单次任务执行中受污染技能进行评估，并通过临时风险清单枚举危害类型。为填补这些空白，我们提出SkillHarm——一项覆盖技能使用全生命周期的基准测试，并配以系统化的技能相关风险分类体系。SkillHarm评估两种攻击场景：固定载荷投毒（FPP），即固定受污染技能包会直接危害任何调用它的任务会话；自变异投毒（SMP），即初始良性的执行会悄然改变持久的技能内容，将危害延迟到后续复用。基于风险所针对的智能体工作流组件，它进一步定义了12种风险类型：数据管道、系统环境和智能体自主性。为规模化实例化这些攻击，我们构建了AutoSkillHarm——一种通过自然语言驱动编码智能体的自动化构建流水线。最终基准测试包含71项技能的879个攻击样本。实验表明，当前智能体仍存在脆弱性，FPP攻击成功率达86.3%，SMP达69.3%。我们的分析进一步揭示了一个潜在风险：许多明显的攻击失败源于智能体未能接触受污染文件，而非真正的抵抗能力，且当前防御措施仍无法可靠缓解这一威胁。

English

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.