SEMA:针对多轮越狱攻击的简洁高效学习策略
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
February 6, 2026
作者: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
cs.AI
摘要
多轮越狱攻击捕捉了安全对齐聊天机器人的真实威胁模型,而单轮攻击仅是特例。然而现有方法因探索复杂性和意图漂移问题而失效。我们提出SEMA——一个简单而有效的框架,无需依赖现有策略或外部数据即可训练多轮攻击者。SEMA包含两个阶段:前缀自调整通过微调非拒绝、结构良好的多轮对抗提示(仅需最小前缀即可自生成)来获得可用推演,从而稳定后续学习;基于意图漂移感知奖励的强化学习则训练攻击者生成有效的多轮对抗提示,同时保持相同的有害目标。我们通过融合意图对齐度、合规风险与细节层次的意图漂移感知奖励,在多轮越狱中锚定有害意图。开环攻击机制避免依赖受害者反馈,统一单轮与多轮设置,并降低探索复杂度。在多个数据集、受害者模型和越狱评估器上,我们的方法实现了最先进的攻击成功率,优于所有单轮基线、人工编写和模板驱动的多轮基线,以及我们的监督微调和直接偏好优化变体。例如在AdvBench基准上,SEMA对三个闭源与开源受害者模型的平均ASR@1达80.1%,超出原SOTA方法33.9%。该方法紧凑可复现,具有跨目标迁移能力,为大型语言模型安全提供了更强更真实的压力测试,支持通过自动红队测试暴露和定位失效模式。代码已开源:https://github.com/fmmarkmq/SEMA。
English
Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.