大型语言模型中真理的表征稳定性
Representational Stability of Truth in Large Language Models
November 24, 2025
作者: Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad
cs.AI
摘要
大型语言模型(LLMs)在处理事实性任务时被广泛应用,例如“哮喘的治疗方法有哪些?”或“拉脱维亚的首都是哪里?”。然而,这些模型在其内部概率表征中如何稳定地区分真实、虚假以及非真非假内容,目前尚不明确。我们提出表征稳定性的概念,即LLM对真实性表征在操作定义扰动下的鲁棒性。我们通过以下方式评估表征稳定性:(i)在LLM的激活值上训练线性探针以区分真实与非真实陈述;(ii)在受控标签变化下测量其学习到的决策边界偏移程度。通过分析16个开源模型在三个事实性领域的激活数据,我们比较了两类非真非假陈述:第一类是关于我们确信未出现在任何训练数据中的实体的事实性断言,称为陌生型非真陈述;第二类是从知名虚构语境中提取的非事实主张,称为熟悉型非真陈述。研究发现,陌生型陈述会引发最大的边界偏移,在脆弱领域(如词汇定义)导致高达40%的真值判断反转,而熟悉的虚构陈述则保持更连贯的聚类特征,仅产生较小变化(≤8.2%)。这些结果表明,表征稳定性更多源于认知熟悉度而非语言形式。更广泛而言,我们的方法为审计和训练LLMs提供了一种诊断工具,使其在语义不确定性下保持连贯的真值分配,而非仅优化输出准确性。
English
Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to 40% flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes (leq 8.2%). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.