대조적 활성화 추가를 통한 Llama 2 조정

초록

우리는 언어 모델의 순방향 전파(forward pass) 과정에서 활성화(activation)를 수정하여 모델을 조종하는 혁신적인 방법인 Contrastive Activation Addition(CAA)을 소개한다. CAA는 사실적 응답과 허구적 응답과 같은 특정 행동의 긍정적 예시와 부정적 예시 간의 잔차 스트림(residual stream) 활성화 차이를 평균화하여 '조종 벡터(steering vector)'를 계산한다. 추론 과정에서, 이러한 조종 벡터는 사용자 프롬프트 이후의 모든 토큰 위치에 긍정적 또는 부정적 계수와 함께 추가되어, 목표 행동의 정도를 정밀하게 제어할 수 있게 한다. 우리는 CAA의 효과를 Llama 2 Chat 모델을 사용하여 객관식 행동 질문 데이터셋과 자유형 생성 과제에서 평가한다. 이를 통해 CAA가 모델 행동을 크게 변화시키고, 파인튜닝(finetuning)이나 퓨샷 프롬프팅(few-shot prompting)과 같은 전통적인 방법을 능가하며, 모델의 능력을 최소한으로 감소시킨다는 것을 입증한다. 또한, 다양한 활성화 공간 해석 방법을 사용함으로써 CAA의 메커니즘에 대한 깊은 통찰을 얻는다. CAA는 모델 출력을 정확하게 조종할 뿐만 아니라, 대규모 언어 모델(LLM)에서 고차원 개념이 어떻게 표현되는지에 대한 이해를 제공한다.

English

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying activations during their forward passes. CAA computes ``steering vectors'' by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using both multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, outperforms traditional methods like finetuning and few-shot prompting, and minimally reduces capabilities. Moreover, by employing various activation space interpretation methods, we gain deeper insights into CAA's mechanisms. CAA both accurately steers model outputs and also sheds light on how high-level concepts are represented in Large Language Models (LLMs).

대조적 활성화 추가를 통한 Llama 2 조정

Steering Llama 2 via Contrastive Activation Addition

초록

Support