ChatPaper.aiChatPaper

SEMA:针对多轮越狱攻击的简洁高效学习策略

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

February 6, 2026
作者: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
cs.AI

摘要

多輪越獄攻擊捕捉了安全對齊聊天機器人的真實威脅模型,而單輪攻擊僅是其中一種特殊案例。然而現有方法在探索複雜性和意圖漂移問題上存在侷限。我們提出SEMA——一個無需依賴現有策略或外部數據即可訓練多輪攻擊者的簡潔高效框架。該框架包含兩個階段:前綴自調優通過對自我生成的、具備非拒絕性且結構良好的多輪對抗提示進行微調(僅需極小前綴),實現可用的推演軌跡,從而穩定後續學習過程;基於意圖漂移感知獎勵的強化學習則訓練攻擊者生成有效的多輪對抗提示,同時保持固定的有害目標。我們通過融合意圖對齊度、合規風險與細節層級的意圖漂移感知獎勵機制,在多輪越獄中錨定有害意圖。所採用的開環攻擊機制避免依賴受害者反饋,統一了單輪與多輪攻擊場景,並降低探索複雜度。在多個數據集、受害者模型和越獄評測環境中,本方法實現了最優的攻擊成功率,不僅超越所有單輪基準方法,也勝過手動編寫和模板驅動的多輪基準方法,以及我們的監督微調和直接偏好優化變體。例如在AdvBench數據集上,SEMA對三個閉源與開源受害者模型達到平均80.1%的ASR@1指標,較之前最優方法提升33.9%。該方法具有輕量化、可復現和跨目標遷移的特性,能為大語言模型安全提供更強勁且逼真的壓力測試,並通過自動化紅隊測試暴露和定位失效模式。代碼已開源於:https://github.com/fmmarkmq/SEMA。
English
Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
PDF62March 16, 2026