주문을 뒤집기: 랭크-원 안전성 주입을 통한 경량 정렬 증폭

초록

대규모 언어 모델(LLMs)의 안전 정렬(safety alignment)은 종종 내부 표현을 조정하여 유해한 요청을 거부하도록 하는 과정을 포함합니다. 최근 연구에 따르면, 이러한 안전 메커니즘은 모델 내 특정 표현 방향을 제거하거나 약화시킴으로써 우회될 수 있음이 밝혀졌습니다. 본 논문에서는 반대 접근법을 제안합니다: Rank-One Safety Injection(ROSI)는 모델의 활성화를 거부를 조정하는 부분 공간으로 영구적으로 유도함으로써 안전 정렬을 강화하는 화이트박스(white-box) 방법입니다. ROSI는 모든 잔류 스트림(residual stream) 쓰기 행렬에 적용되는 간단하고 미세 조정(fine-tuning)이 필요 없는 랭크-1(rank-one) 가중치 수정으로 작동합니다. 필요한 안전 방향은 소규모의 유해 및 무해 명령어 쌍 집합으로부터 계산될 수 있습니다. 우리는 ROSI가 Llama Guard 3로 평가된 안전 거부율을 지속적으로 증가시키는 동시에 MMLU, HellaSwag, Arc와 같은 표준 벤치마크에서 모델의 유용성을 유지함을 보여줍니다. 또한, ROSI가 '검열되지 않은' 모델의 잠재적 안전 방향을 증폭시켜 재정렬할 수 있음을 보여줌으로써, 이를 효과적인 최종 단계 안전 절차로 활용할 수 있음을 입증합니다. 우리의 결과는 목표 지향적이고 해석 가능한 가중치 조정이 LLM 안전성을 개선하는 데 있어 저렴하면서도 강력한 메커니즘이며, 더 많은 자원이 소요되는 미세 조정 패러다임을 보완할 수 있음을 시사합니다.

English

Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

주문을 뒤집기: 랭크-원 안전성 주입을 통한 경량 정렬 증폭

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

초록

Support