先派SCOUT：提示注入防禦中自適應檢測器分配的事前推理

摘要

提示注入檢測器具有異質性：每個檢測器在不同的攻擊類別上表現突出，但沒有一個始終可靠。然而現有系統仍將檢測視為固定的單一檢測器流程，將每個請求交給某個檢測器的盲點去處理。我們將防禦重新定義為檢測器分配：給定一個異質的檢測器池，針對每個請求決定執行哪些檢測器，以及是否升級到 LLM 評判者。我們的框架 SCOUT（可擴展且可控的結果預測以實現不確定性感知分流）使這項決策具有動態性——透過預測每個檢測器對過去相似輸入的行為來推斷其針對當前樣本的可靠性與延遲，並向操作員暴露單一的安全-效用閾值（其中效用統合了良性通過率與實際耗時）。為了評估此設置，我們建構了 SCOUT-450 基準，該基準捕捉了結構複雜、面向代理的注入攻擊，這類攻擊在舊的提示注入集體中代表性不足。在 SCOUT-450 上，相對於始終開啟的 GPT-4o 評判者，一個以安全為導向的操作點將攻擊成功率降低了 46%，總實際耗時降低了 40%，同時良性效用僅下降 5.1 個百分點。SCOUT 在三項外部基準（BIPIA、IPI 和 IHEval）上也展現了遷移能力，改善了安全-效用邊界。

English

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.