WAInjectBench：面向网页代理的提示注入检测基准测试

摘要

针对网络代理，已提出了多种提示注入攻击。与此同时，尽管已开发出多种方法来检测一般的提示注入攻击，但尚未有系统性地针对网络代理进行评估的研究。本工作填补了这一空白，首次提出了针对网络代理的提示注入攻击检测的全面基准研究。我们首先基于威胁模型，对此类攻击进行了细致的分类。随后，构建了包含恶意与良性样本的数据集：恶意文本片段由不同攻击生成，良性文本片段涵盖四种类别；恶意图像由攻击产生，良性图像则来自两个类别。接着，我们系统化地整理了基于文本和图像的检测方法。最后，在多种场景下评估了它们的性能。我们的核心发现表明，虽然部分检测器能够以中等至高准确率识别依赖显式文本指令或可见图像扰动的攻击，但对于省略显式指令或采用不可察觉扰动的攻击，这些检测器大多失效。我们的数据集与代码已发布于：https://github.com/Norrrrrrr-lyn/WAInjectBench。

English

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

WAInjectBench：面向网页代理的提示注入检测基准测试

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

摘要

Support