通过概念消融微调引导分布外泛化
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
July 22, 2025
作者: Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
cs.AI
摘要
微调大型语言模型(LLMs)可能导致意外的分布外泛化。针对这一问题的标准方法依赖于修改训练数据,例如通过添加数据来更明确地指定预期的泛化方向。然而,这种方法并不总是可行。我们引入了概念消融微调(Concept Ablation Fine-Tuning, CAFT),这是一种利用可解释性工具来控制LLMs在微调过程中如何泛化的技术,无需修改训练数据或使用目标分布的数据。给定一组对应于不希望出现概念的LLM潜在空间方向,CAFT通过在微调期间使用线性投影消融这些概念,从而引导模型远离非预期的泛化。我们成功地将CAFT应用于三个微调任务,包括新兴错位现象,即LLMs在狭窄任务上微调后,对一般性问题给出严重错位的响应。在不改变微调数据的情况下,CAFT将错位响应减少了10倍,同时未降低训练分布上的性能。总体而言,CAFT代表了一种无需修改训练数据即可引导LLM泛化的创新方法。
English
Fine-tuning large language models (LLMs) can lead to unintended
out-of-distribution generalization. Standard approaches to this problem rely on
modifying training data, for example by adding data that better specify the
intended generalization. However, this is not always practical. We introduce
Concept Ablation Fine-Tuning (CAFT), a technique that leverages
interpretability tools to control how LLMs generalize from fine-tuning, without
needing to modify the training data or otherwise use data from the target
distribution. Given a set of directions in an LLM's latent space corresponding
to undesired concepts, CAFT works by ablating these concepts with linear
projections during fine-tuning, steering the model away from unintended
generalizations. We successfully apply CAFT to three fine-tuning tasks,
including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow
task generalize to give egregiously misaligned responses to general questions.
Without any changes to the fine-tuning data, CAFT reduces misaligned responses
by 10x without degrading performance on the training distribution. Overall,
CAFT represents a novel approach for steering LLM generalization without
modifying training data.