失控的手术刀:激活导向机制危及大语言模型安全性
The Rogue Scalpel: Activation Steering Compromises LLM Safety
September 26, 2025
作者: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
cs.AI
摘要
激活导向是一种通过直接向模型隐藏状态注入语义向量来控制大语言模型行为的前沿技术。它常被视为一种比微调更为精确、可解释且潜在更安全的方法。然而,我们揭示了相反的事实:导向技术系统性破坏了模型的对齐防护机制,使其更容易响应有害请求。通过对不同模型系列的大量实验,我们发现,即便是随机方向的导向,也能将有害请求的遵从概率从0%提升至2-27%。更为令人警觉的是,利用稀疏自编码器(SAE)——一种常见的可解释方向来源——对良性特征进行导向,会进一步将这一概率提升2-4%。最后,我们证明,结合20个针对单一提示的越狱向量,可以构建出通用攻击,显著提高模型对未见请求的有害遵从率。这些发现挑战了“通过可解释性确保安全”的范式,表明对模型内部精确控制并不等同于对模型行为的精确掌控。
English
Activation steering is a promising technique for controlling LLM behavior by
adding semantically meaningful vectors directly into a model's hidden states
during inference. It is often framed as a precise, interpretable, and
potentially safer alternative to fine-tuning. We demonstrate the opposite:
steering systematically breaks model alignment safeguards, making it comply
with harmful requests. Through extensive experiments on different model
families, we show that even steering in a random direction can increase the
probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign
features from a sparse autoencoder (SAE), a common source of interpretable
directions, increases these rates by a further 2-4%. Finally, we show that
combining 20 randomly sampled vectors that jailbreak a single prompt creates a
universal attack, significantly increasing harmful compliance on unseen
requests. These results challenge the paradigm of safety through
interpretability, showing that precise control over model internals does not
guarantee precise control over model behavior.