失控的手术刀：激活引导机制对大型语言模型安全性的威胁

摘要

激活導向是一種通過在推理過程中直接向模型的隱藏狀態添加具有語義意義的向量來控制大型語言模型（LLM）行為的有前景技術。它常被視為一種精確、可解釋且可能更安全的微調替代方案。然而，我們的研究表明，激活導向會系統性地破壞模型的安全對齊機制，使其順應有害請求。通過對不同模型家族進行廣泛實驗，我們發現，即使在隨機方向上進行導向，也能將有害順應的概率從0%提升至2-27%。更令人擔憂的是，使用稀疏自編碼器（SAE）——一種常見的可解釋方向來源——對良性特徵進行導向，會進一步將這些概率提升2-4%。最後，我們展示，結合20個隨機採樣的向量，這些向量能夠破解單一提示，從而形成一種通用攻擊，顯著提高對未見請求的有害順應率。這些結果挑戰了通過可解釋性實現安全的範式，表明對模型內部的精確控制並不能保證對模型行為的精確控制。

English

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

失控的手术刀：激活引导机制对大型语言模型安全性的威胁

The Rogue Scalpel: Activation Steering Compromises LLM Safety

摘要

Support