ASGuard: 표적 재킹 공격 완화를 위한 활성화 스케일링 가드

초록

대규모 언어 모델(LLM)은 안전성 정렬이 이루어졌음에도 불구하고, 단순한 언어적 변화로 우회될 수 있는 취약한 거부 행동을 보입니다. 시제 탈옥(Tense Jailbreaking)은 유해한 요청을 거부하는 모델이 과거 시제로 재구성될 경우 종종 요청을 수용한다는 점을 보여주며, 현재의 정렬 방법에 내재된 메커니즘에 대한 이해가 부족한 치명적인 일반화 격차를 드러냅니다. 본 연구에서는 이러한 특정 취약점을 정교하게 완화하는 통찰력 있고 메커니즘 기반의 프레임워크인 Activation-Scaling Guard(ASGuard)를 소개합니다. 첫 번째 단계에서는 회로 분석을 사용하여 시제 변경 공격과 같은 표적 탈옥과 인과적으로 연결된 특정 어텐션 헤드를 식별합니다. 두 번째 단계에서는 시제 취약 헤드의 활성화를 재조정하기 위해 정밀한 채널별 스케일링 벡터를 학습합니다. 마지막으로 이를 '예방적 미세 조정'에 적용하여 모델이 더욱 강력한 거부 메커니즘을 학습하도록 유도합니다. 4가지 LLM에 대한 실험에서 ASGuard는 일반 능력을 보존하고 과도한 거부를 최소화하면서 표적 탈옥의 공격 성공률을 효과적으로 감소시켜 안전성과 유용성 간의 파레토 최적 균형을 달성했습니다. 우리의 메커니즘 분석에 따르면, 적대적 접미사가 거부를 매개하는 방향의 전파를 억제하는 방식이 확인되었습니다. 더 나아가, 본 연구는 모델 내부에 대한 깊은 이해가 모델 행동을 조정하기 위한 실용적이고 효율적이며 표적화된 방법을 개발하는 데 어떻게 활용될 수 있는지 보여주며, 더욱 신뢰할 수 있고 해석 가능한 AI 안전을 위한 방향을 제시합니다.

English

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

ASGuard: 표적 재킹 공격 완화를 위한 활성화 스케일링 가드

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

초록

Support