まずSCOUTを送信せよ：プロンプトインジェクション防御における適応的検出器割り当てのための事前推論

要旨

プロンプトインジェクション検出器は異種混合であり、各検出器は異なる攻撃スライスに強く、常に信頼できるものは存在しない。しかし、既存のシステムは依然として検出を固定された単一検出器パイプラインとして扱い、すべてのリクエストを一つの検出器の盲点に委ねている。我々は防御を検出器割り当てとして再定義する。すなわち、異種混合プールが与えられたとき、リクエストごとにどの検出器を実行し、LLM判定器にエスカレーションするかを決定する。我々のフレームワークSCOUT（Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage）は、過去の類似入力に対する各検出器の振る舞いから、サンプルごとの信頼性とレイテンシを予測することで、この決定を動的に行い、さらに運用者に対して単一の安全性-ユーティリティ閾値を提供する（ここでユーティリティは良性通過率と実経過時間を束ねたものである）。この設定を評価するために、我々はSCOUT-450を構築した。これは、従来のプロンプトインジェクションセットでは過小評価されていた、構造的に複雑でエージェント向けのインジェクションを捉えるベンチマークである。SCOUT-450において、安全性重視の動作点では、常時稼働のGPT-4o判定器と比較して、攻撃成功率を46%削減し、総実経過時間を40%短縮し、良性ユーティリティの低下は5.1ポイントにとどまる。SCOUTはまた、3つの外部ベンチマーク（BIPIA, IPI, IHEval）に転移し、安全性-ユーティリティフロンティアを改善する。

English

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.