言語モデルにおける線形的真理符号化の出現

要旨

最近のプロービング研究により、大規模言語モデルには真偽の陳述を分離する線形部分空間が存在することが明らかになっているが、その発生メカニズムは未解明である。本研究では、このような真理部分空間をエンドツーエンドで再現し、その発生経路を具体的に示す透過的な1層トランスフォーマーのトイモデルを提案する。我々は、真理符号化が発生し得る単純な設定——事実陳述が他の事実陳述と共起（およびその逆）するデータ分布——を検討し、モデルが将来のトークンに対するLM損失を低減するためにこの区別を学習する過程を分析する。このパターンは事前学習済み言語モデルにおける実験でも実証される。最後に、トイ設定では2段階の学習ダイナミクスが観察される：ネットワークはまず数ステップで個々の事実連合を記憶し、その後より長い期間をかけて真偽の線形分離を学習する。これにより言語モデリング損失がさらに低減される。これらの結果は、言語モデルにおいて線形真理表現が如何にして、そして何故発生するのかについて、メカニズム的実証と経験的動機の両方を提供する。

English

Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

言語モデルにおける線形的真理符号化の出現

Emergence of Linear Truth Encodings in Language Models

要旨

Support