概念除去による微調整を用いた分布外汎化の制御

要旨

大規模言語モデル（LLM）のファインチューニングは、意図しない分布外汎化を引き起こす可能性がある。この問題に対する標準的なアプローチは、トレーニングデータを変更することに依存しており、例えば、意図した汎化をより明確にするデータを追加するなどが挙げられる。しかし、これは常に実用的とは限らない。本研究では、トレーニングデータを変更したり、ターゲット分布のデータを使用したりすることなく、LLMの汎化を制御するために解釈可能性ツールを活用する「概念除去ファインチューニング（CAFT）」という手法を提案する。CAFTは、LLMの潜在空間内で望ましくない概念に対応する方向のセットが与えられた場合、ファインチューニング中に線形射影を用いてこれらの概念を除去し、意図しない汎化からモデルを遠ざける。我々はCAFTを3つのファインチューニングタスクに適用し、その中には、狭いタスクにファインチューニングされたLLMが一般的な質問に対して著しく不適切な応答を生成する「エマージェント・ミスアライメント」という現象も含まれる。ファインチューニングデータを変更することなく、CAFTは不適切な応答を10分の1に削減し、トレーニング分布における性能を低下させることなく成果を達成した。全体として、CAFTはトレーニングデータを変更せずにLLMの汎化を制御する新たなアプローチを提供する。

English

Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

概念除去による微調整を用いた分布外汎化の制御

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

要旨

Support