真实性谱系假说

摘要

大型语言模型（LLMs）曾被报道能线性编码真实性，但近期研究对该结论的普适性提出质疑。我们通过真实性谱系假说调和这两种观点：表征空间中存在从广泛领域通用到狭窄领域专用的方向谱系。为验证该假说，我们系统评估了探针在五种真实性类型（定义性、经验性、逻辑性、虚构性与伦理性）、谄媚性与期望反转型谎言以及现有诚实基准上的泛化能力。线性探针在多数领域泛化良好，但在谄媚性与期望反转型谎言上失效。然而联合所有领域训练后性能显著恢复，证实领域通用方向确实存在，尽管领域间迁移效果较差。探针方向的几何特征解释了这一现象：马氏余弦相似度近乎完美地预测跨领域泛化能力（R^2=0.98）。概念擦除方法进一步分离出三类真实性方向：（1）领域通用型、（2）领域专用型、以及（3）特定领域子集共享型。因果干预表明领域专用方向比领域通用方向具有更强的调控效力。最后，后训练会重塑真实性几何结构，使谄媚性谎言与其他真实性类型的距离增大，这为聊天模型的谄媚倾向提供了表征基础。综合来看，我们的结果支持真实性谱系假说：不同泛化程度的真实性方向共存于表征空间，后训练则重塑其几何构型。所有实验代码详见https://github.com/zfying/truth_spec。

English

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.