透過對比激活添加來操控Steering Llama 2
Steering Llama 2 via Contrastive Activation Addition
December 9, 2023
作者: Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner
cs.AI
摘要
我們介紹了對語言模型進行引導的創新方法——對比激活添加(Contrastive Activation Addition,CAA)。CAA通過在前向傳遞期間修改激活來計算“引導向量”,這些向量是通過對特定行為(例如事實性與幻覺性回應)的正負示例之間的殘差流激活差的平均值來計算的。在推斷過程中,這些引導向量以正或負係數添加到用戶提示後的所有標記位置,從而精確控制目標行為的程度。我們在Llama 2 Chat上使用多選行為問題數據集和開放生成任務評估了CAA的有效性。我們展示了CAA顯著改變模型行為,優於微調和少樣本提示等傳統方法,並且僅對能力進行了最小程度的降低。此外,通過應用各種激活空間解釋方法,我們對CAA的機制有了更深入的了解。CAA不僅可以準確引導模型輸出,還可以揭示大型語言模型(LLMs)中高級概念的表示方式。
English
We introduce Contrastive Activation Addition (CAA), an innovative method for
steering language models by modifying activations during their forward passes.
CAA computes ``steering vectors'' by averaging the difference in residual
stream activations between pairs of positive and negative examples of a
particular behavior such as factual versus hallucinatory responses. During
inference, these steering vectors are added at all token positions after the
user's prompt with either a positive or negative coefficient, allowing precise
control over the degree of the targeted behavior. We evaluate CAA's
effectiveness on Llama 2 Chat using both multiple-choice behavioral question
datasets and open-ended generation tasks. We demonstrate that CAA significantly
alters model behavior, outperforms traditional methods like finetuning and
few-shot prompting, and minimally reduces capabilities. Moreover, by employing
various activation space interpretation methods, we gain deeper insights into
CAA's mechanisms. CAA both accurately steers model outputs and also sheds light
on how high-level concepts are represented in Large Language Models (LLMs).