通过对比激活添加来控制Steering Llama 2。

摘要

我们引入了对比激活添加（Contrastive Activation Addition，CAA）这一创新方法，用于通过在前向传递过程中修改激活来引导语言模型。CAA通过计算“转向向量”来实现，该向量是通过对特定行为（例如事实性与臆想性回应）的正负示例之间的残差流激活差异进行平均而得到的。在推理过程中，在用户提示之后的所有标记位置，将这些转向向量以正负系数的形式相加，从而精确控制目标行为的程度。我们在Llama 2 Chat上使用多项选择行为问题数据集和开放式生成任务评估了CAA的有效性。我们证明CAA显著改变了模型行为，优于微调和少样本提示等传统方法，并且对模型能力的影响最小。此外，通过采用各种激活空间解释方法，我们深入了解了CAA的机制。CAA不仅能准确引导模型输出，还揭示了大型语言模型（LLMs）中高级概念的表示方式。

English

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying activations during their forward passes. CAA computes ``steering vectors'' by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using both multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, outperforms traditional methods like finetuning and few-shot prompting, and minimally reduces capabilities. Moreover, by employing various activation space interpretation methods, we gain deeper insights into CAA's mechanisms. CAA both accurately steers model outputs and also sheds light on how high-level concepts are represented in Large Language Models (LLMs).

通过对比激活添加来控制Steering Llama 2。

Steering Llama 2 via Contrastive Activation Addition

摘要

Support