迟了一轮:多轮对话中针对隐藏恶意意图的响应感知防御
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
May 12, 2026
作者: Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li
cs.AI
摘要
多轮对话中隐藏的恶意意图对已部署的大语言模型(LLMs)构成日益严重的威胁。相比在单次提示中暴露有害目标,愈发强大的攻击者能够将恶意意图分散到多个看似无害的对话轮次中。近期研究表明,即便配备先进防护机制的现代商业模型,在安全对齐和外部防护措施取得进展的背景下,仍难以抵御此类攻击。本文通过检测最早的对话轮次(在该轮次中,若给出候选回复,累积交互将足以实现有害行为)来应对这一挑战。该目标需要精确的轮次级干预,既能识别触发有害行为的闭合节点,又能避免过早拒绝无害的探索性对话。为支持训练与评估,我们构建了多轮意图数据集(MTID),其中包含分支式攻击展开路径、匹配的无害困难负样本以及最早触发有害行为轮次的标注。我们证明,MTID有助于训练轮次级监测器TurnGate,该监测器在有害意图检测方面显著优于现有基线,同时保持较低的错误拒绝率。TurnGate还能跨领域、攻击者流水线及目标模型实现泛化。我们的代码已开源:https://github.com/Graph-COM/TurnGate。
English
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.