합의에 가려진 특권: LLM 정확성 속 우월적 지식의 실체

초록

인간은 외부 관찰자가 접근할 수 없는 사적 내적 상태를 통해 자기 성찰을 이용해 자신의 이해도를 평가합니다. 본 연구에서는 대규모 언어 모델이 외부 관찰을 통해 알 수 없는 정답 정확성에 대한 유사한 특권적 지식을 보유하는지 조사합니다. 우리는 모델 자체의 은닉 상태와 외부 모델 모두에서 추출한 질문 표현에 정확도 분류기를 훈련시켜, 자기 표현이 성능 우위를 제공하는지 테스트합니다. 표준 평가에서 우리는 자기 탐색이 동료 모델 탐색과 유사한 성능을 보여 우위가 없음을 발견했습니다. 우리는 이것이 정답 정확성에 대한 모델 간 높은 일치도 때문이라고 가정합니다. 진정한 특권적 지식을 분리하기 위해, 모델들이 상충되는 예측을 생성하는 불일치 하위 집합에서 평가를 진행했습니다. 여기서 우리는 도메인 특화된 특권적 지식을 발견했습니다: 사실 지식 과제에서는 자기 표현이 지속적으로 동료 표현을 능가했으나, 수학 추론에서는 우위를 보이지 않았습니다. 우리는 더 나아가 이 도메인 비대칭성을 모델 계층 전체에 걸쳐 국소화했으며, 사실 지식 우위가 초기-중간 계층부터 점진적으로 나타나 모델 특화적 기억 검색과 일관된 반면, 수학 추론에서는 어떤 깊이에서도 일관된 우위를 보이지 않음을 확인했습니다.

English

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

합의에 가려진 특권: LLM 정확성 속 우월적 지식의 실체

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

초록

Support