晚了一輪：針對多輪對話中隱藏惡意意圖的感知回應防禦

摘要

多輪對話中的隱藏惡意意圖對已部署的大型語言模型（LLMs）構成日益嚴重的威脅。與其在單一提示中暴露有害目標，越來越強大的攻擊者能夠將其意圖分散到多個看似無害的對話輪次中。近期研究顯示，即使擁有先進防護機制的現代商業模型，在安全對齊與外部防護措施取得進展後，仍難以抵禦此類攻擊。在本研究中，我們透過偵測「交付候選回應將使累積互動足以觸發有害行為」的最早輪次，來應對此挑戰。此目標需實現精準的輪次級干預，識別能允許危害的閉合點，同時避免對良性探索性對話過早拒絕。為進一步支援訓練與評估，我們建構了多輪意圖資料集（MTID），其中包含分支攻擊路徑、配對的良性硬負樣本，以及最早允許危害輪次的標註。我們證明，MTID有助於開發輪次級監控器TurnGate，其在有害意圖偵測上大幅優於現有基準線，同時維持較低的過度拒答率。TurnGate更能跨領域、攻擊者流程與目標模型進行泛化。我們的程式碼公開於 https://github.com/Graph-COM/TurnGate。

English

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

晚了一輪：針對多輪對話中隱藏惡意意圖的感知回應防禦

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

摘要

Support