真実性スペクトル仮説

要旨

大規模言語モデル（LLM）には真実性が線形的に符号化されていると報告されているが、近年の研究はこの知見の一般性に疑問を投げかけている。本研究では、これらの見解を「真実性スペクトル仮説」によって統合する。すなわち、表現空間には、広範なドメイン一般性から狭義のドメイン特異性に至るまでの連続的な方向性が存在するという仮説である。この仮説を検証するため、5種類の真実性（定義的、経験的、論理的、虚構的、倫理的）、同調的・期待反転的嘘、既存の誠実性ベンチマークにわたって、プローブの一般化性能を体系的に評価した。線形プローブはほとんどのドメインで良好な一般化を示したが、同調的および期待反転的嘘では失敗した。しかし、全ドメインを統合して学習すると高い性能が回復し、 pairwise な転移が弱いにもかかわらずドメイン一般的方向が存在することが確認された。プローブ方向の幾何学的関係はこれらのパターンを説明する：プローブ間のマハラノビス余弦類似度は、ドメイン間一般化をほぼ完全に予測した（R^2=0.98）。概念消去法を用いることで、(1)ドメイン一般、(2)ドメイン特異、または(3)特定のドメイン部分集合間で共有される、真実性方向をさらに分離した。因果介入により、ドメイン特異的方向はドメイン一般的方向よりも効果的にモデルを制御できることが明らかになった。最後に、学習後処理は真実性の幾何学的構造を変化させ、同調的嘘を他の真実タイプから遠ざけることで、チャットモデルにおける同調的傾向の表現的基盤を示唆した。総合して、我々の結果は真実性スペクトル仮説を支持する：様々な一般性を持つ真実性方向が表現空間内で共存し、学習後処理がその幾何学的構造を形成する。すべての実験のコードは https://github.com/zfying/truth_spec で公開されている。

English

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

真実性スペクトル仮説

The Truthfulness Spectrum Hypothesis

要旨

Support