真相神經元
Truth Neurons
May 18, 2025
作者: Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
cs.AI
摘要
儘管語言模型在多樣化的工作流程中取得了顯著成功並得到廣泛應用,但有時仍會產生不真實的回應。我們對這些模型如何機制性地編碼真實性的理解有限,這威脅到了它們的可靠性和安全性。在本論文中,我們提出了一種在神經元層面識別真實性表徵的方法。我們展示了語言模型包含著以主題無關方式編碼真實性的“真實神經元”。通過對不同規模模型進行的實驗,我們驗證了真實神經元的存在,確認了在神經元層面編碼真實性是許多語言模型共有的特性。真實神經元在各層的分佈模式與先前關於真實性幾何結構的研究結果一致。選擇性地抑制通過TruthfulQA數據集發現的真實神經元的激活,不僅在TruthfulQA上表現下降,也在其他基準測試中表現出性能退化,這表明真實性機制並非特定於某個數據集。我們的研究結果為理解語言模型中真實性的機制提供了新的見解,並指出了提升其可信度和可靠性的潛在方向。
English
Despite their remarkable success and deployment across diverse workflows,
language models sometimes produce untruthful responses. Our limited
understanding of how truthfulness is mechanistically encoded within these
models jeopardizes their reliability and safety. In this paper, we propose a
method for identifying representations of truthfulness at the neuron level. We
show that language models contain truth neurons, which encode truthfulness in a
subject-agnostic manner. Experiments conducted across models of varying scales
validate the existence of truth neurons, confirming that the encoding of
truthfulness at the neuron level is a property shared by many language models.
The distribution patterns of truth neurons over layers align with prior
findings on the geometry of truthfulness. Selectively suppressing the
activations of truth neurons found through the TruthfulQA dataset degrades
performance both on TruthfulQA and on other benchmarks, showing that the
truthfulness mechanisms are not tied to a specific dataset. Our results offer
novel insights into the mechanisms underlying truthfulness in language models
and highlight potential directions toward improving their trustworthiness and
reliability.Summary
AI-Generated Summary