語言模型中的線性表徵在對話過程中可能發生劇烈變化
Linear representations in language models can change dramatically over a conversation
January 28, 2026
作者: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
cs.AI
摘要
語言模型的表徵常包含對應高層次概念的線性方向。本研究探討這些表徵的動態特性:在(模擬)對話情境中,表徵如何沿著這些維度演變。我們發現線性表徵在對話過程中可能發生劇烈變化;例如,對話初期被表徵為事實的資訊,在對話結束時可能被表徵為非事實,反之亦然。這種變化具有內容依賴性:與對話相關的資訊表徵可能改變,但通用資訊通常保持穩定。即使對於能將事實性與表面回應模式分離的維度,這些變化依然穩健存在,且出現在不同模型家族與模型層級中。此類表徵變化無需依賴策略性對話——即使重播由完全不同的模型編寫的對話腳本,也能產生類似變化。然而,若僅在上下文置入明確標註為科幻小說的故事,適應效果則大幅減弱。我們還證明,沿表徵方向進行引導時,在對話不同節點可能產生截然不同的效果。這些結果符合以下觀點:表徵的演變可能是模型因應對話提示而扮演特定角色的結果。我們的研究發現可能對可解釋性與引導技術構成挑戰——具體而言,這意味著靜態解讀特徵或方向,或假設特定特徵範圍始終對應特定真實值的探測方法,可能產生誤導。然而,這類表徵動態也為理解模型如何適應上下文開闢了令人振奮的新研究方向。
English
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.