AgentWatcher: 규칙 기반 프롬프트 인젝션 모니터

초록

대규모 언어 모델(LLM)과 에이전트와 같은 그 응용 프로그램은 프롬프트 인젝션 공격에 매우 취약합니다. 최첨단 프롬프트 인젝션 탐지 방법은 다음과 같은 한계점을 가지고 있습니다: (1) 컨텍스트 길이가 증가함에 따라 그 효과성이 크게 저하되며, (2) 무엇이 프롬프트 인젝션을 구성하는지 정의하는 명시적인 규칙이 부족하여 탐지 결정이 암묵적이고 불투명하며 추론하기 어렵습니다. 본 연구에서는 위의 두 가지 한계를 해결하기 위해 AgentWatcher를 제안합니다. 첫 번째 한계를 해결하기 위해 AgentWatcher는 LLM의 출력(예: 에이전트의 행동)을 소수의 인과적으로 영향력 있는 컨텍스트 세그먼트에 귀인시킵니다. 상대적으로 짧은 텍스트에 탐지를 집중함으로써, AgentWatcher는 긴 컨텍스트에도 확장 가능할 수 있습니다. 두 번째 한계를 해결하기 위해, 우리는 무엇이 프롬프트 인젝션을 구성하는지와 그렇지 않은지를 명시하는 일련의 규칙을 정의하고, 모니터 LLM을 사용하여 귀인된 텍스트를 바탕으로 이러한 규칙을 추론하게 하여 탐지 결정을 더 설명 가능하게 만듭니다. 우리는 도구 사용 에이전트 벤치마크와 장문 컨텍스트 이해 데이터셋에 대해 포괄적인 평가를 수행합니다. 실험 결과는 AgentWatcher가 프롬프트 인젝션을 효과적으로 탐지하고 공격이 없을 때는 유틸리티를 유지할 수 있음을 보여줍니다. 코드는 https://github.com/wang-yanting/AgentWatcher 에서 확인할 수 있습니다.

English

Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.

AgentWatcher: 규칙 기반 프롬프트 인젝션 모니터

AgentWatcher: A Rule-based Prompt Injection Monitor

초록

Support