合意によって覆い隠された特権知識：LLMの正答性における特権知識の解明

要旨

人間は内省を通じて、外部観察者にはアクセス不可能な私的な内的状態に基づき自身の理解を評価する。本研究では、大規模言語モデルが同様の、外部観察では得られない回答の正しさに関する特権的知識を有するかどうかを検討する。我々は、モデル自身の隠れ状態と外部モデルの両方から得られた質問表現に対して正解分類器を学習させ、自己表現が性能優位性をもたらすかどうかを検証した。標準的な評価では、優位性は認められなかった。自己プローブは、他モデルプローブと同等の性能を示した。これは、回答の正しさに対するモデル間の高い一致率が原因であると仮説を立てた。真の特権的知識を分離するため、モデルが矛盾する予測を生成する不一致サブセットで評価を行った。その結果、領域特異的な特権的知識を発見した。すなわち、事実知識タスクでは自己表現が他モデル表現を一貫して上回ったが、数学的推論では優位性は見られなかった。さらに、この領域非対称性をモデル層間で局在化させたところ、事実知識における優位性は初期層から中期層にかけて漸進的に現れ（モデル特有の記憶検索と一致）、数学的推論ではどの層深度においても一貫した優位性は見られなかった。

English

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

合意によって覆い隠された特権知識：LLMの正答性における特権知識の解明

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

要旨

Support