教師なしLLM知識発見における課題

要旨

我々は、大規模言語モデル（LLM）の活性化に関する既存の教師なし手法が知識を発見しないことを示す。むしろ、それらの手法は活性化の最も顕著な特徴を発見しているように見える。教師なし知識抽出の背後にある考え方は、知識が一貫性構造を満たすため、その構造を利用して知識を発見できるというものである。我々はまず理論的に、任意の特徴（知識だけでなく）が特定の主要な教師なし知識抽出手法であるコントラスト一貫性探索（Burns et al. - arXiv:2212.03827）の一貫性構造を満たすことを証明する。次に、教師なし手法が知識を予測するのではなく、別の顕著な特徴を予測する分類器を生成する設定を示す一連の実験を提示する。我々は、潜在的な知識を発見するための既存の教師なし手法が不十分であると結論付け、将来の知識抽出手法を評価する際に適用すべき健全性チェックを提案する。概念的には、ここで探求した識別問題（例えば、モデルの知識とシミュレートされたキャラクターの知識を区別する問題）が、将来の教師なし手法においても持続すると仮説を立てる。

English

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

教師なしLLM知識発見における課題

Challenges with unsupervised LLM knowledge discovery

要旨

Support