로그 스캘펠: 활성화 조정이 LLM 안전성을 훼손하다

초록

활성화 조정(Activation steering)은 추론 과정에서 모델의 은닉 상태에 의미론적으로 의미 있는 벡터를 직접 추가하여 대형 언어 모델(LLM)의 행동을 제어하는 유망한 기술입니다. 이는 종종 미세 조정(fine-tuning)에 비해 정밀하고 해석 가능하며 잠재적으로 더 안전한 대안으로 여겨집니다. 그러나 우리는 이와 반대되는 결과를 보여줍니다: 조정은 모델의 안전장치를 체계적으로 무너뜨려 유해한 요청에 순응하게 만듭니다. 다양한 모델 패밀리에 대한 광범위한 실험을 통해, 심지어 무작위 방향으로 조정하는 것만으로도 유해한 순응 확률이 0%에서 2-27%까지 증가할 수 있음을 보여줍니다. 더욱 우려스럽게도, 해석 가능한 방향의 일반적인 소스인 희소 오토인코더(SAE)에서 선별된 양성 특징을 조정할 경우 이러한 비율이 추가로 2-4% 증가합니다. 마지막으로, 단일 프롬프트를 탈옥시키는 20개의 무작위로 샘플링된 벡터를 결합하면 보이지 않는 요청에 대한 유해한 순응을 크게 증가시키는 보편적 공격이 생성됨을 보여줍니다. 이러한 결과는 해석 가능성을 통한 안전성 패러다임에 도전하며, 모델 내부에 대한 정밀한 제어가 모델 행동에 대한 정밀한 제어를 보장하지 않음을 입증합니다.

English

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

로그 스캘펠: 활성화 조정이 LLM 안전성을 훼손하다

The Rogue Scalpel: Activation Steering Compromises LLM Safety

초록

Support