먼저 SCOUT을 보내라: 프롬프트 인젝션 방어에서 적응형 탐지기 할당을 위한 사전 추론

초록

프롬프트 인젝션 탐지기는 이질적이다. 각 탐지기는 서로 다른 공격 분야에서 강점을 보이며, 어떤 탐지기도 항상 신뢰할 수는 없다. 그러나 기존 시스템은 여전히 탐지를 고정된 단일 탐지기 파이프라인으로 취급하여 모든 요청을 하나의 탐지기가 가진 사각지대에 맡기고 있다. 우리는 방어를 탐지기 할당 문제로 재정의한다. 즉, 이질적인 탐지기 풀이 주어졌을 때, 요청별로 어떤 탐지기를 실행할지, 그리고 LLM 판별기로 이관할지 여부를 결정한다. 우리의 프레임워크 SCOUT(Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage)은 각 탐지기가 유사한 과거 입력에 대해 어떻게 동작했는지로부터 샘플별 신뢰도와 지연 시간을 예측함으로써 이러한 결정을 동적으로 수행하며, 운영자에게 단일 안전-효용 임계값(여기서 효용은 정상 통과율과 실시간 시간을 묶은 것)을 제공한다. 이 설정을 평가하기 위해 우리는 SCOUT-450 벤치마크를 구축했으며, 이는 기존 프롬프트 인젝션 데이터셋이 과소 대표했던 구조적으로 복잡하고 에이전트를 대상으로 하는 인젝션을 포착한다. SCOUT-450에서 안전 중심 운용점은 항상 작동하는 GPT-4o 판별기에 비해 공격 성공률을 46%, 총 실시간 시간을 40% 감소시키며, 정상 효용은 5.1포인트 하락한다. SCOUT은 또한 세 가지 외부 벤치마크(BIPIA, IPI, IHEval)로 전이되어 안전-효용 경계를 개선한다.

English

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.