개념 제거 미세 조정을 통한 분포 외 일반화 조정

초록

대규모 언어 모델(LLM)의 미세 조정은 의도하지 않은 분포 외 일반화를 초래할 수 있다. 이 문제에 대한 표준적인 접근 방식은 훈련 데이터를 수정하는 데 의존하며, 예를 들어 의도한 일반화를 더 잘 명시하는 데이터를 추가하는 방법이 있다. 그러나 이는 항상 실용적이지 않다. 본 연구에서는 개념 제거 미세 조정(CAFT) 기법을 소개한다. 이 기법은 해석 가능성 도구를 활용하여 LLM이 미세 조정에서 어떻게 일반화되는지를 제어하며, 훈련 데이터를 수정하거나 대상 분포의 데이터를 사용할 필요가 없다. CAFT는 LLM의 잠재 공간에서 원치 않는 개념에 해당하는 방향 집합이 주어졌을 때, 미세 조정 과정에서 이러한 개념을 선형 투영을 통해 제거함으로써 모델이 의도하지 않은 일반화에서 벗어나도록 유도한다. 우리는 CAFT를 세 가지 미세 조정 작업에 성공적으로 적용했으며, 이 중 하나는 LLM이 좁은 작업에 대해 미세 조정된 후 일반적인 질문에 심각하게 부정확한 응답을 하는 현상인 '발생적 부정합'이다. 미세 조정 데이터를 변경하지 않고도 CAFT는 부정합 응답을 10배 줄이면서도 훈련 분포에서의 성능을 저하시키지 않았다. 전반적으로, CAFT는 훈련 데이터를 수정하지 않고도 LLM 일반화를 조절하는 새로운 접근 방식을 제시한다.

English

Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

개념 제거 미세 조정을 통한 분포 외 일반화 조정

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

초록

Support