진실성 스펙트럼 가설

초록

대규모 언어 모델(LLM)이 사실성을 선형적으로 인코딩한다는 보고가 있었으나, 최근 연구에서는 이 발견의 일반성을 의문시하고 있습니다. 우리는 이러한 관점을 '사실성 스펙트럼 가설'로 조화시킵니다. 즉, 표현 공간에는 광범위한 도메인 일반적 방향부터 좁은 도메인 특정적 방향에 이르기까지 다양한 방향성이 존재한다는 가설입니다. 이 가설을 검증하기 위해 우리는 다섯 가지 사실 유형(정의적, 경험적, 논리적, 허구적, 윤리적), 아척성 및 기대 반전 거짓말, 그리고 기존의 정직성 벤치마크에 걸쳐 프로브 일반화를 체계적으로 평가합니다. 선형 프로브는 대부분의 도메인에서 잘 일반화되지만, 아척성 및 기대 반전 거짓말에서는 실패합니다. 그러나 모든 도메인을 함께 훈련하면 강력한 성능이 회복되며, 이는 쌍별 전이가 낮음에도 불구하고 도메인 일반적 방향이 존재함을 확인시켜 줍니다. 프로브 방향의 기하학적 구조는 이러한 패턴을 설명합니다. 프로브 간의 마할라노비스 코사인 유사도는 도메인 간 일반화를 거의 완벽하게 예측합니다(R^2=0.98). 개념 제거 방법을 통해 (1) 도메인 일반적, (2) 도메인 특정적, 또는 (3) 특정 도메인 하위 집합에서만 공유되는 사실 방향을 추가로 분리해냅니다. 인과적 개입을 통해 도메인 특정적 방향이 도메인 일반적 방향보다 효과적으로 조정됨을 확인합니다. 마지막으로, 사후 훈련은 사실성 기하학을 재구성하여 아척성 거짓말을 다른 사실 유형으로부터 더 멀리 밀어내며, 이는 채팅 모델의 아척성 경향에 대한 표현적 기반을 시사합니다. 종합적으로, 우리의 결과는 사실성 스펙트럼 가설을 지지합니다. 즉, 다양한 일반성을 가진 사실 방향들이 표현 공간에 공존하며, 사후 훈련이 이들의 기하학적 구조를 재형성한다는 것입니다. 모든 실험에 대한 코드는 https://github.com/zfying/truth_spec에서 제공됩니다.

English

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

진실성 스펙트럼 가설

The Truthfulness Spectrum Hypothesis

초록

Support