引导的LLM激活是非满射的

摘要

激活引导是一种流行的白盒控制技术，通过修改模型激活来引发其行为的抽象变化。它已成为可解释性（例如探测真实性、将激活转化为人类可读的解释）和安全研究（例如越狱可能性）中的标准工具。然而，被引导的行为是否能够通过任何文本提示实现仍不明确。在本研究中，我们将这一问题归结为满射性问题：对于固定模型，是否每个被引导的激活都在模型自然前向传播下存在原像？在实用假设下，我们证明激活引导会将残差流推离从离散提示可达到的状态流形。几乎必然地，没有任何提示能复现由引导引起的相同内部行为。我们还在三个广泛使用的LLM上通过实验验证了这一发现。我们的结果确立了白盒可引导性与黑盒提示之间的形式化分离。因此，我们提醒不应将激活引导的简便性和成功解读为基于提示的可解释性或脆弱性的证据，并主张采用明确区分白盒和黑盒干预的评估协议。

English

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.