AgentWatcher: Un Monitoraggio per l'Iniezione di Prompt Basato su Regole

Abstract

I grandi modelli linguistici (LLM) e le loro applicazioni, come gli agenti, sono estremamente vulnerabili ad attacchi di prompt injection. I metodi di rilevamento di prompt injection allo stato dell'arte presentano le seguenti limitazioni: (1) la loro efficacia si degrada significativamente all'aumentare della lunghezza del contesto, e (2) mancano di regole esplicite che definiscano cosa costituisce un prompt injection, rendendo le decisioni di rilevamento implicite, opache e difficili da analizzare. In questo lavoro, proponiamo AgentWatcher per affrontare le due limitazioni sopra citate. Per affrontare la prima limitazione, AgentWatcher attribuisce l'output del LLM (ad esempio, l'azione di un agente) a un piccolo insieme di segmenti di contesto causalmente influenti. Concentrando il rilevamento su un testo relativamente breve, AgentWatcher può essere scalabile per contesti lunghi. Per affrontare la seconda limitazione, definiamo un insieme di regole che specificano cosa costituisce e cosa non costituisce un prompt injection, e utilizziamo un LLM monitor per ragionare su queste regole basandosi sul testo attribuito, rendendo le decisioni di rilevamento più spiegabili. Abbiamo condotto una valutazione completa su benchmark di agenti con uso di strumenti e su dataset di comprensione a contesto lungo. I risultati sperimentali dimostrano che AgentWatcher può rilevare efficacemente i prompt injection e mantenere l'utilità in assenza di attacchi. Il codice è disponibile all'indirizzo https://github.com/wang-yanting/AgentWatcher.

English

Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.

AgentWatcher: Un Monitoraggio per l'Iniezione di Prompt Basato su Regole

AgentWatcher: A Rule-based Prompt Injection Monitor

Abstract

Support