비지도 학습 기반 대형 언어 모델 지식 탐색의 과제

초록

우리는 대규모 언어 모델(LLM) 활성화에 대한 기존의 비지도 학습 방법들이 지식을 발견하지 못하며, 대신 활성화의 가장 두드러진 특징을 발견하는 것처럼 보인다는 것을 보여줍니다. 비지도 학습을 통한 지식 추출의 기본 아이디어는 지식이 일관성 구조를 만족하며, 이를 통해 지식을 발견할 수 있다는 것입니다. 우리는 먼저 이론적으로 임의의 특징(지식뿐만 아니라)이 특정 선도적인 비지도 지식 추출 방법인 대조 일관성 탐색(Contrast-Consistent Search, Burns et al. - arXiv:2212.03827)의 일관성 구조를 만족한다는 것을 증명합니다. 그런 다음, 비지도 학습 방법이 지식을 예측하지 않고 대신 다른 두드러진 특징을 예측하는 분류기를 생성하는 실험 시리즈를 제시합니다. 우리는 잠재 지식을 발견하기 위한 기존의 비지도 학습 방법들이 불충분하다고 결론 내리고, 향후 지식 추출 방법을 평가할 때 적용할 수 있는 검증 방법을 제안합니다. 개념적으로, 우리는 여기서 탐구된 식별 문제들(예: 모델의 지식과 시뮬레이션된 캐릭터의 지식을 구분하는 문제)이 향후 비지도 학습 방법에서도 지속될 것이라고 가정합니다.

English

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

비지도 학습 기반 대형 언어 모델 지식 탐색의 과제

Challenges with unsupervised LLM knowledge discovery

초록

Support