無監督式LLM知識發現的挑戰

摘要

我們展示現有的在大型語言模型（LLM）激活上的無監督方法並未發現知識，而似乎是發現最突出的激活特徵。無監督知識引出的概念是知識滿足一致性結構，可用於發現知識。我們首先在理論上證明任意特徵（不僅僅是知識）滿足特定領先的無監督知識引出方法的一致性結構，即對比一致搜索（Burns等人 - arXiv:2212.03827）。然後，我們提出一系列實驗，展示無監督方法在某些情況下導致分類器並非預測知識，而是預測不同突出特徵。我們得出結論，現有用於發現潛在知識的無監督方法是不足夠的，我們提供了用於評估未來知識引出方法的理智檢查。從概念上講，我們假設這裡探討的識別問題，例如區分模型的知識和模擬角色的知識，將持續存在於未來的無監督方法中。

English

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

無監督式LLM知識發現的挑戰

Challenges with unsupervised LLM knowledge discovery

摘要

Support