SEMA: 다중 턴 재닉브레이크 공격을 위한 간단하지만 효과적인 학습 방법

초록

멀티턴 재킹은 안전 정렬 챗봇에 대한 실제 위협 모델을 포착하며, 단일턴 공격은 이의 특수한 경우에 불과합니다. 그러나 기존 접근법은 탐색 복잡성과 의도 이탈로 인해 효과가 떨어집니다. 우리는 기존 전략이나 외부 데이터에 의존하지 않고 멀티턴 공격자를 훈련시키는 간단하면서 효과적인 프레임워크인 SEMA를 제안합니다. SEMA는 두 단계로 구성됩니다. 프리필링 자기 튜닝은 최소한의 접두사로 자체 생성된 비거부적, 구조화된, 멀티턴 적대적 프롬프트에 대해 미세 조정을 통해 사용 가능한 롤아웃을 가능하게 하여 후속 학습을 안정화합니다. 의도 이탈 인식 보상 강화 학습은 공격자가 동일한 유해한 목적을 유지하면서 유효한 멀티턴 적대적 프롬프트를 이끌어내도록 훈련시킵니다. 우리는 의도 정렬, 준수 위험, 상세 수준을 결합한 의도 이탈 인식 보상을 통해 멀티턴 재킹에서 유해 의도를 고정합니다. 우리의 개방형 루프 공격 체계는 피드백 피해자에 대한 의존성을 피하고, 단일턴 및 멀티턴 설정을 통일하며, 탐색 복잡성을 줄입니다. 여러 데이터셋, 피해자 모델, 재킹 판단 기준에 걸쳐 우리의 방법은 최첨단 공격 성공률을 달성하여 모든 단일턴 기준선, 수동 스크립트 및 템플릿 기반 멀티턴 기준선, 그리고 우리의 지도 미세 조정 및 직접 선호 최적화 변형을 능가합니다. 예를 들어, SEMA는 AdvBench의 세 가지 폐쇄형 및 오픈소스 피해자 모델에서 평균 80.1%의 ASR@1을 보여주며, 이는 최첨단 대비 33.9% 높은 수치입니다. 이 접근법은 간결하고 재현 가능하며 대상 간 이전이 가능하여 대규모 언어 모델 안전성에 대한 더 강력하고 현실적인 스트레스 테스트를 제공하며, 자동 레드팀링을 통해 실패 모드를 노출하고 위치를 특정할 수 있게 합니다. 우리의 코드는 https://github.com/fmmarkmq/SEMA에서 확인할 수 있습니다.

English

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

SEMA: 다중 턴 재닉브레이크 공격을 위한 간단하지만 효과적인 학습 방법

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

초록

Support