通过基于SAE的表示工程来引导LLM中的知识选择行为

摘要

大型语言模型（LLMs）可以在其参数中存储大量的事实知识。然而，它们的参数化知识可能与上下文中提供的信息相冲突 -- 这种现象被称为上下文记忆知识冲突，可能导致模型行为不良，例如依赖过时或不正确的信息。通过分析LLMs的内部激活，我们发现它们可以在中间层内部注册知识冲突的信号。这些信号使我们能够检测知识冲突是否发生，并使用推理时间干预策略来解决它。在这项工作中，我们提出了SpARE，这是一种无需训练的表示工程方法，它使用预训练的稀疏自动编码器（SAEs）来控制LLMs的知识选择行为。SpARE识别控制知识选择行为的功能特征，并将它们应用于编辑LLMs在推理时间的内部激活。我们的实验结果表明，SpARE可以有效地控制在开放领域问答任务中解决知识冲突的任一知识源的使用，超过现有的表示工程方法（+10%）以及对比解码方法（+15%）。

English

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).

通过基于SAE的表示工程来引导LLM中的知识选择行为

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

摘要

Support