SkillHarm: 자동화된 구축을 통한 생애주기 인식 스킬 기반 공격

초록

에이전트 스킬은 에이전트 워크플로우에서 특권적 위치를 차지한다. 에이전트가 이를 암묵적으로 따르고 실행할 것으로 예상되기 때문에, 서드파티 스킬은 취약한 공격 표면이 된다. 기존 연구들은 스킬 기반 공격으로 유발된 안전하지 않은 에이전트 행동을 밝혀냈지만, 주로 단일 작업 실행 내에서 오염된 스킬을 평가하고 임시방편적 위험 목록을 통해 피해를 열거한다. 이러한 격차를 해소하기 위해, 우리는 스킬 사용 수명주기 전반에 걸친 스킬 기반 공격의 벤치마크인 SkillHarm을 소개하며, 이는 스킬 관련 위험의 체계적인 분류 체계와 함께 제공된다. SkillHarm은 두 가지 공격 시나리오를 평가한다: 고정 페이로드 오염(FPP)은 고정된 오염 스킬 패키지가 이를 호출하는 모든 작업 세션을 직접 손상시키는 경우이고, 자가 변이 오염(SMP)은 초기에는 양호한 실행이 지속적인 스킬 내용을 조용히 변이시켜 후속 재사용까지 피해를 지연시키는 경우이다. 또한 피해가 목표로 하는 에이전트 워크플로우 구성 요소(데이터 파이프라인, 시스템 환경, 에이전트 자율성)에 따라 12가지 위험 유형을 정의한다. 이러한 공격을 대규모로 구현하기 위해, 자연어 하네스로 구동되는 코딩 에이전트를 사용하는 자동화된 구축 파이프라인인 AutoSkillHarm을 구축한다. 결과 벤치마크는 71개의 스킬에 걸쳐 879개의 공격 샘플을 포함한다. 실험 결과 현재 에이전트는 FPP에서 최대 86.3%, SMP에서 최대 69.3%의 공격 성공률로 여전히 취약한 상태임을 보여준다. 우리의 분석은 잠재적 위험을 추가로 밝혀낸다: 많은 겉보기 공격 실패는 진정한 저항보다는 에이전트가 오염된 파일과 상호작용하지 못하는 데서 비롯되며, 현재 방어 체계는 여전히 위협을 신뢰성 있게 완화하지 못한다.

English

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.