ChatPaper.aiChatPaper

真值神经元

Truth Neurons

May 18, 2025
作者: Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
cs.AI

摘要

尽管语言模型在多样化的工作流程中取得了显著成功并得到广泛应用,它们有时仍会产生不真实的回应。我们对于这些模型如何从机制上编码真实性的理解有限,这威胁到了它们的可靠性和安全性。本文提出了一种在神经元层面识别真实性表征的方法。我们证明,语言模型包含真实性神经元,这些神经元以与主题无关的方式编码真实性。通过对不同规模模型进行的实验验证了真实性神经元的存在,证实了在神经元层面编码真实性是许多语言模型共有的特性。真实性神经元在各层中的分布模式与先前关于真实性几何结构的研究结果一致。通过TruthfulQA数据集发现并选择性抑制真实性神经元的激活,不仅降低了在TruthfulQA上的表现,也影响了其他基准测试的结果,表明真实性机制并非特定于某一数据集。我们的研究结果为理解语言模型中真实性的内在机制提供了新的见解,并指出了提升其可信度和可靠性的潜在方向。
English
Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.

Summary

AI-Generated Summary

PDF51May 21, 2025