共识遮蔽：破解大语言模型正确性中的特权知识之谜

摘要

人类通过内省来评估自身理解，这种评估依赖于外部观察者无法触及的私有内部状态。我们探究大型语言模型是否拥有类似的关于答案正确性的特权知识——即无法通过外部观察获取的信息。我们在模型自身隐藏状态和外部模型的问题表征上训练正确性分类器，检验自我表征是否能带来性能优势。标准评估显示二者无显著差异：自我探测与同行模型探测表现相当。我们推测这是由于模型间对答案正确性的高度一致性所致。为分离真正的特权知识，我们在存在预测冲突的分歧子集上进行评估。结果发现领域特异性特权知识：在事实性知识任务中，自我表征持续优于同行表征，但在数学推理中未显现优势。我们进一步定位这种领域不对称性在模型各层的分布，发现事实性优势从早中期层开始逐步显现，这与模型特异性记忆检索机制相符，而数学推理在任何深度均未呈现稳定优势。

English

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

共识遮蔽：破解大语言模型正确性中的特权知识之谜

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

摘要

Support