ChatPaper.aiChatPaper

SkillHarm:透過自動化建構之生命週期感知型技能攻擊

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

June 1, 2026
作者: Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou, Junyi Li, Weitong Ruan, Chentao Ye, Rahul Gupta, Diyi Yang, Yu Su, Huan Sun
cs.AI

摘要

技能在智能体工作流中占据特殊地位,因为智能体需隐式遵循并执行这些技能,这使得第三方技能成为易受攻击的薄弱环节。现有研究虽已揭示基于技能的攻击所引发的智能体不安全行为,但主要针对单次任务执行中受污染技能进行评估,并通过临时风险清单枚举危害类型。为填补这些空白,我们提出SkillHarm——一项覆盖技能使用全生命周期的基准测试,并配以系统化的技能相关风险分类体系。SkillHarm评估两种攻击场景:固定载荷投毒(FPP),即固定受污染技能包会直接危害任何调用它的任务会话;自变异投毒(SMP),即初始良性的执行会悄然改变持久的技能内容,将危害延迟到后续复用。基于风险所针对的智能体工作流组件,它进一步定义了12种风险类型:数据管道、系统环境和智能体自主性。为规模化实例化这些攻击,我们构建了AutoSkillHarm——一种通过自然语言驱动编码智能体的自动化构建流水线。最终基准测试包含71项技能的879个攻击样本。实验表明,当前智能体仍存在脆弱性,FPP攻击成功率达86.3%,SMP达69.3%。我们的分析进一步揭示了一个潜在风险:许多明显的攻击失败源于智能体未能接触受污染文件,而非真正的抵抗能力,且当前防御措施仍无法可靠缓解这一威胁。
English
Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.