프롬프트 인젝션에서 지속적 제어로: 트로이 백도어로부터 에이전트 하네스 방어하기

초록

LLM 에이전트는 대화형 챗봇에서 실제 작업 공간에서의 운영 도구로 진화하고 있다. 로컬 에이전트 하네스에서 LLM은 파일 읽기 및 쓰기, 도구 호출, 세션 간 작업 공간 상태 재사용이 가능하다. 이러한 기능은 유용성을 향상시키지만, 공격자에게 새로운 공격 표면을 노출한다. 공격자는 파일이나 도구 출력 내에 프롬프트 인젝션을 삽입할 수 있다. 에이전트는 이 숨겨진 명령을 읽고 저장한 후 나중에 실행할 수 있다. 이러한 다단계 트로이 목마 공격 패러다임에서는 개별 단계 자체는 악의적으로 보이지 않지만, 이러한 단계들이 결합되어 신뢰할 수 없는 텍스트를 지속적인 제어 콘텐츠로 전환할 수 있다. 그러나 기존 방어 메커니즘은 종종 각 단계를 개별적으로 검사한다. 결과적으로 명백한 유해 행위는 차단할 수 있지만, 백도어를 심는 초기 쓰기 작업은 탐지하지 못한다. 이러한 위협을 드러내기 위해, 우리는 로컬 에이전트 하네스에서 다단계 트로이 목마 공격을 식별하도록 설계된 벤치마크인 ClawTrojan을 소개한다. GPT-5.4를 사용한 OpenClaw 스타일의 시뮬레이션 작업 공간에서 ClawTrojan은 95.5%의 공격 성공률(ASR)을 달성하는 반면, 기존 단일 턴 프롬프트 인젝션 공격은 동일한 모델에서 거의 0에 가까운 ASR을 보인다. 이러한 위협에 대응하기 위해, 우리는 DASGuard를 제안한다. 이는 민감한 로컬 파일에서 제어와 유사한 텍스트를 스캔하고, 그 출처를 추적하며, 신뢰할 수 있는 출처에서 유래하지 않은 제어 콘텐츠를 제거한다. 우리의 결과는 DASGuard가 런타임 공격 차단과 작업 공간에 대한 정화된 커밋을 결합하여 강력한 동적 방어를 달성함을 보여준다.

English

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.