无监督LLM知识发现的挑战

摘要

我们展示现有的大型语言模型（LLM）激活上的无监督方法并未发现知识，而似乎是发现了激活中最突出的特征。无监督知识引发背后的理念是，知识满足一致性结构，可用于发现知识。我们首先在理论上证明，任意特征（不仅仅是知识）都满足特定领先的无监督知识引发方法的一致性结构，即对比一致性搜索（Burns等人 - arXiv:2212.03827）。然后，我们展示了一系列实验，展示了无监督方法在某些设置下导致分类器无法预测知识，而是预测了另一个突出的特征。我们得出结论，现有用于发现潜在知识的无监督方法是不足够的，并为评估未来知识引发方法提供了理智检查。从概念上讲，我们假设这里探讨的识别问题，例如区分模型知识和模拟角色知识，将持续存在于未来的无监督方法中。

English

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

无监督LLM知识发现的挑战

Challenges with unsupervised LLM knowledge discovery

摘要

Support