通过基于SAE的表示工程来引导LLM中的知识选择行为
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
October 21, 2024
作者: Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Kam-Fai Wong, Pasquale Minervini
cs.AI
摘要
大型语言模型(LLMs)可以在其参数中存储大量的事实知识。然而,它们的参数化知识可能与上下文中提供的信息相冲突 -- 这种现象被称为上下文记忆知识冲突,可能导致模型行为不良,例如依赖过时或不正确的信息。通过分析LLMs的内部激活,我们发现它们可以在中间层内部注册知识冲突的信号。这些信号使我们能够检测知识冲突是否发生,并使用推理时间干预策略来解决它。在这项工作中,我们提出了SpARE,这是一种无需训练的表示工程方法,它使用预训练的稀疏自动编码器(SAEs)来控制LLMs的知识选择行为。SpARE识别控制知识选择行为的功能特征,并将它们应用于编辑LLMs在推理时间的内部激活。我们的实验结果表明,SpARE可以有效地控制在开放领域问答任务中解决知识冲突的任一知识源的使用,超过现有的表示工程方法(+10%)以及对比解码方法(+15%)。
English
Large language models (LLMs) can store a significant amount of factual
knowledge in their parameters. However, their parametric knowledge may conflict
with the information provided in the context -- this phenomenon, known as
context-memory knowledge conflicts, can lead to undesirable model
behaviour, such as reliance on outdated or incorrect information. Analysing the
internal activations of LLMs, we find that they can internally register the
signals of knowledge conflict at mid-layers. Such signals allow us to detect
whether a knowledge conflict occurs and use inference-time intervention
strategies to resolve it. In this work, we propose SpARE, a
training-free representation engineering method that uses pre-trained
sparse auto-encoders (SAEs) to control the knowledge selection behaviour of
LLMs. SpARE identifies the functional features that control the
knowledge selection behaviours and applies them to edit the internal
activations of LLMs at inference time. Our experimental results show that
SpARE can effectively control the usage of either knowledge source to
resolve knowledge conflict in open-domain question-answering tasks, surpassing
existing representation engineering methods (+10%) as well as contrastive
decoding methods (+15%).Summary
AI-Generated Summary