掩于共识之下：大语言模型正确性中的特权知识解构

摘要

人類透過內省來評估自身理解，這種評估依賴於外部觀察者無法觸及的內在狀態。本研究探討大型語言模型是否具備類似的特權知識——即關於答案正確性的、無法通過外部觀察獲取的信息。我們分別基於模型自身隱藏狀態和外部模型的問題表徵訓練正確性分類器，檢驗自我表徵是否具有性能優勢。在標準評估中，我們發現並無優勢：自我探測器的表現與同類模型探測器相當。我們推測這源於模型間對答案正確性的高度共識。為分離真正的特權知識，我們在存在預測分歧的子集上進行評估，結果發現領域特定的特權知識：在事實性知識任務中，自我表徵始終優於同類模型表徵，但在數學推理任務中未顯現優勢。我們進一步沿模型層次定位這種領域不對稱性，發現事實性優勢從中低層開始逐步顯現，這與模型特定的記憶檢索機制相符，而數學推理在任何深度均未呈現穩定優勢。

English

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

掩于共识之下：大语言模型正确性中的特权知识解构

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

摘要

Support