AgentWatcher：基于规则的提示注入监控系统

摘要

大型语言模型（LLMs）及其应用（如智能体）极易受到提示注入攻击。当前最先进的提示注入检测方法存在以下局限性：（1）随着上下文长度的增加，其检测效果显著下降；（2）缺乏明确定义提示注入行为的规则，导致检测决策隐含、不透明且难以追溯。本研究提出AgentWatcher以解决上述两个问题。针对第一个局限性，AgentWatcher将LLM的输出（例如智能体的动作）归因于少量具有因果影响力的上下文片段。通过将检测聚焦于相对较短的文本，AgentWatcher可扩展至长上下文场景。针对第二个局限性，我们定义了一套明确区分提示注入与非注入行为的规则，并利用监控LLM基于归因文本进行规则推理，使检测决策更具可解释性。我们在工具使用智能体基准测试和长上下文理解数据集上进行了全面评估。实验结果表明，AgentWatcher能有效检测提示注入攻击，并在无攻击场景下保持功能效用。代码已开源：https://github.com/wang-yanting/AgentWatcher。

English

Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.

AgentWatcher：基于规则的提示注入监控系统

AgentWatcher: A Rule-based Prompt Injection Monitor

摘要

Support