한 턴 늦은: 다중 턴 대화에서 숨겨진 악의적 의도에 대한 응답 인식 방어

초록

다중 턴 대화에서의 은밀한 악의적 의도는 배포된 대규모 언어 모델(LLM)에 대한 증가하는 위협을 제기한다. 공격자는 유해한 목표를 단일 프롬프트에 노출시키는 대신, 점점 더 정교해지면서 여러 개의 평범해 보이는 턴에 걸쳐 의도를 분산시킬 수 있다. 최근 연구에 따르면, 고급 안전장치를 갖춘 현대 상용 모델조차도 안전 정렬 및 외부 안전장치의 발전에도 불구하고 이러한 공격에 취약한 상태로 남아 있다. 본 연구에서는 후보 응답을 제공했을 때 축적된 상호작용이 유해한 행동을 가능하게 하기에 충분해지는 최초의 턴을 탐지함으로써 이 문제를 해결한다. 이 목표는 유해 행동을 가능하게 하는 종결 지점을 식별하면서도 무해한 탐색적 대화를 조기에 거부하지 않는 정밀한 턴 수준의 개입을 요구한다. 또한 훈련 및 평가를 지원하기 위해, 분기 공격 롤아웃, 매칭된 무해한 하드 네거티브, 그리고 최초 유해 가능 턴에 대한 주석을 포함하는 다중 턴 의도 데이터셋(MTID)을 구축한다. MTID가 턴 수준 모니터인 TurnGate를 가능하게 하여, 낮은 과잉 거부율을 유지하면서 기존 기준선보다 유해 의도 탐지에서 훨씬 뛰어난 성능을 보임을 입증한다. TurnGate는 또한 다양한 도메인, 공격 파이프라인 및 대상 모델에 걸쳐 일반화된다. 코드는 https://github.com/Graph-COM/TurnGate에서 확인할 수 있다.

English

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

한 턴 늦은: 다중 턴 대화에서 숨겨진 악의적 의도에 대한 응답 인식 방어

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

초록

Support