語言模型中的線性真理編碼湧現現象
Emergence of Linear Truth Encodings in Language Models
October 17, 2025
作者: Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti
cs.AI
摘要
近期探測研究表明,大型語言模型存在能區分真假陳述的線性子空間,但其形成機制尚不明確。我們提出一個透明化的單層Transformer玩具模型,端到端重現了這類真理子空間,並揭示了其形成的具體路徑。我們研究了一種可能產生真理編碼的簡單情境:數據分佈中事實陳述與其他事實陳述共現(反之亦然),這種分佈促使模型學習區分真偽以降低對後續詞元的語言建模損失。我們通過預訓練語言模型的實驗驗證了這一模式。最後在玩具設定中,我們觀察到雙階段學習動態:網絡先通過少量步驟記憶個別事實關聯,隨後在更長訓練週期中學會線性區分真偽,從而進一步降低語言建模損失。這些結果共同從機制實證和實驗動機兩方面,闡明了線性真理表徵在語言模型中形成的方式與原因。
English
Recent probing studies reveal that large language models exhibit linear
subspaces that separate true from false statements, yet the mechanism behind
their emergence is unclear. We introduce a transparent, one-layer transformer
toy model that reproduces such truth subspaces end-to-end and exposes one
concrete route by which they can arise. We study one simple setting in which
truth encoding can emerge: a data distribution where factual statements
co-occur with other factual statements (and vice-versa), encouraging the model
to learn this distinction in order to lower the LM loss on future tokens. We
corroborate this pattern with experiments in pretrained language models.
Finally, in the toy setting we observe a two-phase learning dynamic: networks
first memorize individual factual associations in a few steps, then -- over a
longer horizon -- learn to linearly separate true from false, which in turn
lowers language-modeling loss. Together, these results provide both a
mechanistic demonstration and an empirical motivation for how and why linear
truth representations can emerge in language models.