一ターン遅れ：マルチターン対話における隠された悪意に対する応答認識型防御

要旨

マルチターン対話における隠れた悪意は、デプロイされた大規模言語モデル（LLM）にとって増大する脅威となっている。攻撃者は、単一のプロンプトで有害な目的を露呈させるのではなく、悪意を複数の一見無害なターンに分散させることが可能であり、その能力はますます高度化している。近年の研究では、安全性の調整や外部ガードレールの進歩にもかかわらず、高度なガードレールを備えた現代の商用モデルでさえ、このような攻撃に対して脆弱であることが示されている。本研究では、累積された対話が有害な行動を可能にするのに十分となる時点において、候補となる応答を提供する際の最初のターンを検出することで、この課題に取り組む。この目的は、害を可能にする閉鎖点を特定しつつ、無害な探索的対話の時期尚早な拒否を回避する、精密なターンレベルの介入を必要とする。さらに、トレーニングと評価を支援するために、分岐する攻撃ロールアウト、対応する無害なハードネガティブ、および最初の害を可能にするターンのアノテーションを含むマルチターン意図データセット（MTID）を構築する。MTIDが、ターンレベルのモニタであるTurnGateを実現するのに役立ち、TurnGateは低い過剰拒否率を維持しながら、既存のベースラインを有害意図検出において大幅に上回ることを示す。さらに、TurnGateはドメイン、攻撃パイプライン、およびターゲットモデル間で汎化する。コードはhttps://github.com/Graph-COM/TurnGateで公開されている。

English

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

一ターン遅れ：マルチターン対話における隠された悪意に対する応答認識型防御

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

要旨

Support