ChatPaper.aiChatPaper

真实性谱系假说

The Truthfulness Spectrum Hypothesis

February 23, 2026
作者: Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase
cs.AI

摘要

大型语言模型(LLMs)曾被报道能线性编码真实性,但近期研究对该结论的普适性提出质疑。我们通过真实性谱系假说调和这两种观点:表征空间中存在从广泛领域通用到狭窄领域专用的方向谱系。为验证该假说,我们系统评估了探针在五种真实性类型(定义性、经验性、逻辑性、虚构性与伦理性)、谄媚性与期望反转型谎言以及现有诚实基准上的泛化能力。线性探针在多数领域泛化良好,但在谄媚性与期望反转型谎言上失效。然而联合所有领域训练后性能显著恢复,证实领域通用方向确实存在,尽管领域间迁移效果较差。探针方向的几何特征解释了这一现象:马氏余弦相似度近乎完美地预测跨领域泛化能力(R^2=0.98)。概念擦除方法进一步分离出三类真实性方向:(1)领域通用型、(2)领域专用型、以及(3)特定领域子集共享型。因果干预表明领域专用方向比领域通用方向具有更强的调控效力。最后,后训练会重塑真实性几何结构,使谄媚性谎言与其他真实性类型的距离增大,这为聊天模型的谄媚倾向提供了表征基础。综合来看,我们的结果支持真实性谱系假说:不同泛化程度的真实性方向共存于表征空间,后训练则重塑其几何构型。所有实验代码详见https://github.com/zfying/truth_spec。
English
Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.
PDF12February 27, 2026