先遣SCOUT：面向提示注入防御的自适应检测器分配中的预先推理

摘要

提示注入检测器是异构的：每个检测器在不同攻击类型上各有优势，但没有一个始终可靠。然而，现有系统仍将检测视为固定的单检测器流水线，将每个请求交到某个检测器的盲区中。我们将防御重新定义为检测器分配：给定一个异构检测器池，针对每个请求决定运行哪些检测器，以及是否升级到LLM评判器。我们的框架SCOUT（可扩展且可控的结果预测，用于不确定性感知分流）通过预测每个检测器在类似历史输入上的样本级可靠性和延迟，使这一决策变得动态化，并向操作员暴露一个单一的安全-效用阈值（其中效用综合了良性通过率和实际耗时）。为评估这一场景，我们构建了SCOUT-450基准，该基准涵盖了旧版提示注入集所不足的、结构复杂的面向智能体的注入。在SCOUT-450上，与始终开启的GPT-4o评判器相比，一个面向安全的操作点将攻击成功率降低了46%，总实际耗时降低了40%，同时良性效用仅下降5.1个百分点。SCOUT还能迁移到三个外部基准（BIPIA、IPI和IHEval），改善了安全-效用前沿。

English

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.