受引導的LLM激活是非滿射的

摘要

激活導向是一種常見的白箱控制技術，通過修改模型激活值來引發行為上的抽象變化。該技術亦已成為可解釋性（例如探測真實性，或將激活值轉譯為人類可讀的解釋）與安全性研究（例如越獄可能性）中的標準工具。然而，目前尚不清楚導向後的行為是否能由任何文本提示（prompt）實現。在本研究中，我們將此問題歸結為一個滿射性問題：對於一個固定的模型，是否每個導向後的激活值在模型自然前向傳播過程中都存在原像？在實際可行的假設下，我們證明激活導向會將殘差流推離離散提示可達狀態所構成的流形。幾乎可以肯定，沒有任何提示能夠重現由導向所誘發的相同內部行為。我們也透過三個廣泛使用的大型語言模型（LLM）實證驗證了此發現。我們的研究結果確立了白箱可控性與黑箱提示之間的正式區隔。因此，我們提醒不應將激活導向的簡便與成功解讀為基於提示的可解釋性或脆弱性的證據，並主張採用明確區分白箱與黑箱干預的評估協議。

English

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.