誘導されたLLM活性化は非全射である

要旨

アクティベーションステアリングは、モデルの活性化を変更してその動作に抽象的な変化を引き起こす、広く用いられているホワイトボックス制御手法である。これは、解釈可能性（例えば、真実性の調査や、活性化を人間が読める説明に変換する）や安全性研究（例えば、脱獄可能性）における標準的なツールにもなっている。しかし、ステアリングによって導かれた動作が、何らかのテキストプロンプトによって実現可能であるかどうかは不明である。本研究では、この問題を全射性問題として定式化する。すなわち、固定されたモデルにおいて、ステアリングされたすべての活性化に対して、モデルの自然な順伝播の下での原像が存在するかどうかを問う。実用的な仮定の下で、アクティベーションステアリングが残差ストリームを、離散プロンプトから到達可能な状態の多様体から押し出すことを証明する。ほとんど確実に、ステアリングによって誘発されたのと同じ内部動作を再現できるプロンプトは存在しない。また、この知見を三つの広く使われている大規模言語モデル（LLM）で実験的に示す。本結果は、ホワイトボックスによるステアリング可能性とブラックボックスによるプロンプティングとの間に形式的な分離を確立する。したがって、アクティベーションステアリングの容易さと成功を、プロンプトに基づく解釈可能性や脆弱性の証拠として解釈することに対して警告を発し、ホワイトボックス介入とブラックボックス介入を明示的に分離する評価プロトコルを提唱する。

English

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.