通過概念消融微調引導分佈外泛化
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
July 22, 2025
作者: Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
cs.AI
摘要
微調大型語言模型(LLM)可能導致意外的分佈外泛化。針對此問題的標準方法依賴於修改訓練數據,例如通過添加能更好指定預期泛化的數據。然而,這並不總是可行的。我們引入了概念消融微調(CAFT),這是一種利用可解釋性工具來控制LLM從微調中如何泛化的技術,而無需修改訓練數據或使用目標分佈的數據。給定一組對應於不期望概念的LLM潛在空間方向,CAFT通過在微調期間使用線性投影消融這些概念,引導模型遠離非預期的泛化。我們成功將CAFT應用於三個微調任務,包括新興錯位現象,即LLM在狹窄任務上微調後,對一般問題給出嚴重錯位的回應。在不改變微調數據的情況下,CAFT將錯位回應減少了10倍,且未降低訓練分佈上的性能。總體而言,CAFT代表了一種無需修改訓練數據即可引導LLM泛化的新穎方法。
English
Fine-tuning large language models (LLMs) can lead to unintended
out-of-distribution generalization. Standard approaches to this problem rely on
modifying training data, for example by adding data that better specify the
intended generalization. However, this is not always practical. We introduce
Concept Ablation Fine-Tuning (CAFT), a technique that leverages
interpretability tools to control how LLMs generalize from fine-tuning, without
needing to modify the training data or otherwise use data from the target
distribution. Given a set of directions in an LLM's latent space corresponding
to undesired concepts, CAFT works by ablating these concepts with linear
projections during fine-tuning, steering the model away from unintended
generalizations. We successfully apply CAFT to three fine-tuning tasks,
including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow
task generalize to give egregiously misaligned responses to general questions.
Without any changes to the fine-tuning data, CAFT reduces misaligned responses
by 10x without degrading performance on the training distribution. Overall,
CAFT represents a novel approach for steering LLM generalization without
modifying training data.