De Waarachtigheidsspectrumhypothese

Samenvatting

Grote taalmodellen (LLM's) zouden waarheidsgetrouwheid lineair coderen, maar recent onderzoek trekt de algemene geldigheid van deze bevinding in twijfel. Wij verzoenen deze perspectieven met de hypothese van het waarheidsspectrum: de representatieruimte bevat richtingen die variëren van breed domeingeneriek tot smal domeinspecifiek. Om deze hypothese te testen, evalueren we systematisch de generalisatie van probes over vijf waarheidstypen (definitie, empirisch, logisch, fictie en ethisch), sycophantisch liegen en liegen met omgekeerde verwachtingen, en bestaande eerlijkheidsbenchmarks. Lineaire probes generaliseren goed over de meeste domeinen, maar falen bij sycophantisch liegen en liegen met omgekeerde verwachtingen. Training op alle domeinen gezamenlijk herstelt echter sterke prestaties, wat bevestigt dat domeingenerieke richtingen bestaan ondanks slechte pairwise transfer. De geometrie van proberichtingen verklaart deze patronen: Mahalanobis-cosinusgelijkenis tussen probes voorspelt kruisdomeingeneralisatie bijna perfect (R²=0.98). Concept-uitwismethoden isoleren verder waarheidsrichtingen die (1) domeingeneriek, (2) domeinspecifiek, of (3) alleen gedeeld worden door bepaalde domeinsubsets zijn. Causale interventies tonen aan dat domeinspecifieke richtingen effectiever sturen dan domeingenerieke. Ten slotte hervormt post-training de waarheidsgeometrie, waarbij sycophantisch liegen verder van andere waarheidstypen wordt geduwd, wat een representatiebasis suggereert voor de sycophantische neigingen van chatmodellen. Samen ondersteunen onze resultaten de hypothese van het waarheidsspectrum: waarheidsrichtingen van uiteenlopende generaliteit bestaan naast elkaar in de representatieruimte, waarbij post-training hun geometrie hervormt. Code voor alle experimenten is beschikbaar op https://github.com/zfying/truth_spec.

English

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

De Waarachtigheidsspectrumhypothese

The Truthfulness Spectrum Hypothesis

Samenvatting

Support