WAInjectBench：面向网络代理的提示注入检测基准测试

摘要

針對網絡代理的多種提示注入攻擊已被提出。與此同時，多種檢測一般提示注入攻擊的方法也相繼開發，但尚未有系統性地針對網絡代理進行評估。在本研究中，我們填補了這一空白，首次對針對網絡代理的提示注入攻擊檢測進行了全面的基準研究。我們首先基於威脅模型，對此類攻擊進行了細緻的分類。隨後，我們構建了包含惡意與良性樣本的數據集：惡意文本片段由不同攻擊生成，良性文本片段來自四類，惡意圖像由攻擊產生，而良性圖像則來自兩類。接著，我們系統化地整理了基於文本和圖像的檢測方法。最後，我們在多種情境下評估了它們的性能。我們的主要發現表明，雖然部分檢測器能夠以中等至高準確度識別依賴於顯式文本指令或可見圖像擾動的攻擊，但對於那些省略顯式指令或採用不可察覺擾動的攻擊，這些檢測器大多失效。我們的數據集和代碼已發佈於：https://github.com/Norrrrrrr-lyn/WAInjectBench。

English

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

WAInjectBench：面向网络代理的提示注入检测基准测试

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

摘要

Support