WAInjectBench: Benchmark per il Rilevamento di Iniezioni di Prompt per Agenti Web

Abstract

Sono stati proposti diversi attacchi di prompt injection contro gli agenti web. Allo stesso tempo, sono stati sviluppati vari metodi per rilevare gli attacchi di prompt injection in generale, ma nessuno è stato valutato sistematicamente per gli agenti web. In questo lavoro, colmiamo questa lacuna presentando il primo studio di benchmark completo sul rilevamento degli attacchi di prompt injection mirati agli agenti web. Iniziamo introducendo una categorizzazione dettagliata di tali attacchi basata sul modello di minaccia. Successivamente, costruiamo dataset contenenti sia campioni malevoli che benigni: segmenti di testo malevoli generati da diversi attacchi, segmenti di testo benigni di quattro categorie, immagini malevole prodotte da attacchi e immagini benignhe di due categorie. Poi, sistematizziamo sia i metodi di rilevamento basati su testo che quelli basati su immagini. Infine, ne valutiamo le prestazioni in diversi scenari. I nostri risultati principali mostrano che, sebbene alcuni rilevatori possano identificare attacchi che si basano su istruzioni testuali esplicite o perturbazioni visibili nelle immagini con una precisione da moderata a elevata, falliscono in gran parte contro attacchi che omettono istruzioni espliciti o utilizzano perturbazioni impercettibili. I nostri dataset e il codice sono rilasciati all'indirizzo: https://github.com/Norrrrrrr-lyn/WAInjectBench.

English

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

WAInjectBench: Benchmark per il Rilevamento di Iniezioni di Prompt per Agenti Web

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Abstract

Support