Inside-Out: 대형 언어 모델에 숨겨진 사실적 지식

초록

본 연구는 대규모 언어 모델(LLM)이 출력물에서 표현하는 것보다 더 많은 사실적 지식을 매개변수에 내재하고 있는지를 평가하기 위한 프레임워크를 제시한다. 몇몇 연구에서 이러한 가능성을 암시하고 있지만, 이를 명확히 정의하거나 입증한 연구는 아직 없다. 우리는 먼저 주어진 질문에 대해 지식을 정량화하는 공식적 정의를 제안한다. 이는 정답과 오답 쌍 중에서 정답이 더 높은 순위에 오르는 비율로 지식을 측정한다. 이를 통해 개별 답변 후보를 평가하는 데 사용되는 정보에 따라 외부 지식과 내부 지식이 구분된다: 외부 지식은 모델의 관측 가능한 토큰 수준 확률을 사용하고, 내부 지식은 모델의 중간 계산 결과를 사용한다. 내부 지식이 외부 지식을 초과할 때, 이는 숨겨진 지식으로 간주된다. 이어서, 우리는 이 프레임워크를 세 가지 인기 있는 오픈 웨이트 LLM에 적용하여 폐쇄형 질의응답(closed-book QA) 설정에서 사례 연구를 진행한다. 연구 결과는 다음과 같다: (1) LLM은 외부적으로 표현하는 것보다 내부적으로 더 많은 사실적 지식을 일관되게 내재하고 있으며, 평균적으로 40%의 격차가 존재한다. (2) 놀랍게도, 일부 지식은 너무 깊이 숨겨져 있어 모델이 내부적으로는 답을 완벽히 알고 있음에도 불구하고, 1,000번의 대규모 반복 샘플링을 통해 답을 생성하지 못하는 경우가 있다. 이는 LLM의 생성 능력에 근본적인 한계가 있음을 보여준다. (3) 이는 폐쇄형 질의응답에서 반복적인 답변 샘플링을 통해 테스트 시 계산 자원을 확장하는 데 실질적인 제약을 가한다: 일부 답변은 실질적으로 샘플링되지 않기 때문에 상당한 성능 개선이 불가능하지만, 만약 샘플링된다면 우리는 이를 반드시 최상위로 평가할 수 있다.

English

This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Inside-Out: 대형 언어 모델에 숨겨진 사실적 지식

Inside-Out: Hidden Factual Knowledge in LLMs

초록

Support