SEMA: マルチターン Jailbreak 攻撃のための簡潔かつ効果的な学習手法

要旨

マルチターン脱獄手法は、安全性が調整されたチャットボットに対する現実的な脅威モデルを捉えており、シングルターン攻撃はその特殊なケースに過ぎない。しかし既存の手法は、探索の複雑さと意図の逸脱によって破綻する。我々はSEMAを提案する。これは既存の戦略や外部データに依存せず、マルチターンの攻撃者を訓練する、簡潔かつ効果的なフレームワークである。SEMAは2段階から構成される。事前入力による自己調整は、最小限の接頭辞から自己生成された、拒否を含まず構造化されたマルチターンの敵対的プロンプトに対してファインチューニングを行うことで、実用的なロールアウトを可能にし、後続の学習を安定させる。意図逸脱認識報酬を用いた強化学習は、同じ有害な目的を維持しながら、有効なマルチターンの敵対的プロンプトを引き出すよう攻撃者を訓練する。我々は、意図の整合性、コンプライアンスリスク、詳細度を組み合わせた意図逸脱認識報酬を通じて、マルチターン脱獄における有害意図を固定化する。本手法のオープンループ攻撃体制は、被害者モデルのフィードバックへの依存を回避し、シングルターンとマルチターンの設定を統一し、探索の複雑さを軽減する。複数のデータセット、被害者モデル、脱獄判定器において、本手法は最先端（SOTA）の攻撃成功率（ASR）を達成し、全てのシングルターンベースライン、手動スクリプトおよびテンプレート駆動のマルチターベースライン、ならびに我々の教師ありファインチューニング（SFT）および直接選好最適化（DPO）バリアントを上回った。例えばSEMAは、AdvBenchにおける3つのクローズドソースおよびオープンソースの被害者モデルに対して平均80.1%のASR@1を達成し、SOTAを33.9%上回る。本アプローチはコンパクトで再現性があり、標的を超えて転移可能であるため、大規模言語モデル（LLM）の安全性に対するより強力で現実的なストレステストを提供し、自動レッドチーミングを通じて故障モードを暴露・特定することを可能にする。コードは以下で公開されている：https://github.com/fmmarkmq/SEMA。

English

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

SEMA: マルチターン Jailbreak 攻撃のための簡潔かつ効果的な学習手法

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

要旨

Support