SkillHarm: 自動構築によるライフサイクルを考慮したスキルベース攻撃

要旨

エージェントスキルはエージェントワークフローにおいて特権的な位置を占めており、エージェントはそれらを暗黙的に従い実行することが期待されるため、サードパーティ製スキルは脆弱な攻撃対象領域となる。既存の研究では、スキルベース攻撃によって誘発される不安全なエージェント行動が明らかにされているが、それらの研究は主に単一のタスク実行内でのポイズンドスキルを評価し、アドホックなリスクリストを通じて害を列挙するに留まっている。これらのギャップを埋めるため、我々はスキル使用ライフサイクル全体にわたるスキルベース攻撃のベンチマークであるSkillHarmを導入し、スキル関連リスクの体系的な分類体系と組み合わせる。SkillHarmは2つの攻撃シナリオを評価する。固定ペイロードポイズニング（FPP）では、固定されたポイズンドスキルパッケージが、それを呼び出すタスクセッションを直接侵害する。自己変異型ポイズニング（SMP）では、当初は良性の実行が永続的なスキルコンテンツを黙って変異させ、その後の再利用まで害を先送りにする。さらに、害が標的とするエージェントワークフローの構成要素（データパイプライン、システム環境、エージェントの自律性）に基づいて12のリスクタイプを定義する。これらの攻撃を大規模に具体化するため、自然言語ハーネスで駆動されるコーディングエージェントを用いた自動構築パイプラインであるAutoSkillHarmを構築する。結果として得られたベンチマークには、71のスキルにわたる879の攻撃サンプルが含まれる。実験の結果、現在のエージェントは依然として脆弱であり、FPPで最大86.3%、SMPで最大69.3%の攻撃成功率を示した。さらに分析により、潜在的なリスクが明らかになった。すなわち、明らかな攻撃失敗の多くは、エージェントがポイズンドファイルに実際に抵抗したのではなく、関与しなかったことに起因しており、現在の防御策は依然としてこの脅威を確実に軽減できていない。

English

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.