语言模型中的线性表征在对话过程中可能发生显著变化。
Linear representations in language models can change dramatically over a conversation
January 28, 2026
作者: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
cs.AI
摘要
语言模型的表征中常存在与高层概念对应的线性方向。本文研究这些表征的动态特性:在(模拟)对话语境中,表征如何沿着这些维度演化。我们发现线性表征在对话过程中会发生剧烈变化:例如,对话初期被表征为事实的信息可能在对话结束时被表征为非事实,反之亦然。这种变化具有内容依赖性:与对话相关的信息表征可能改变,而通用信息通常保持稳定。即使对于能将事实性与表层响应模式分离的维度,这些变化依然稳健存在,并出现在不同模型家族和模型层级中。表征变化无需依赖策略内对话——即使重播由完全不同的模型编写的对话脚本也能产生类似变化。但若仅将科幻故事作为背景语境明确呈现,其适应效应则弱得多。我们还发现,沿着表征方向进行引导在对话不同时点可能产生截然不同的效果。这些结果与"模型会根据对话提示扮演特定角色而演化表征"的观点一致。我们的发现可能对可解释性与引导技术构成挑战——特别是表明静态解读特征或方向、或假设特定特征范围始终对应真实值的探测方法可能产生误导。然而,这类表征动态性也为理解模型如何适应语境指明了令人兴奋的新研究方向。
English
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.