AgentWatcher：基于规则的提示注入监控系统

摘要

大型语言模型（LLMs）及其应用（如智能体）极易受到提示注入攻击。当前最先进的提示注入检测方法存在以下局限性：（1）随着上下文长度的增加，其检测效能显著下降；（2）缺乏明确定义提示注入行为的规则，导致检测决策具有隐含性、不透明性且难以追溯。本研究提出AgentWatcher框架以解决上述两大局限。针对第一个局限，AgentWatcher将LLM的输出（如智能体的行动）归因于少量具有因果影响力的上下文片段。通过将检测聚焦于相对简短的文本，该框架可适配长上下文场景。针对第二个局限，我们制定了一套明确界定提示注入行为的规则集，并采用监控LLM基于归因文本进行规则推理，使检测决策更具可解释性。我们在工具调用智能体基准测试和长上下文理解数据集上进行了全面评估。实验结果表明，AgentWatcher能有效检测提示注入攻击，并在无攻击场景下保持模型效能。代码已开源：https://github.com/wang-yanting/AgentWatcher。

English

Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.