CrispEdit: スケーラブルな非破壊的LLM編集のための低曲率射影

要旨

大規模言語モデル（LLM）編集における中心的な課題は、能力維持である。対象とする振る舞いの変更に成功する手法が、編集の代理指標を密かに操作し、一般的な能力を損なう可能性があり、これは代理指標／報酬ハッキングを想起させる縮退的な振る舞いを生み出す。本論文では、能力維持を明示的な制約として扱い、既存の複数の編集手法を統合・一般化する、スケーラブルで原理に基づいた二階の編集アルゴリズム「CrispEdit」を提案する。CrispEditは編集を制約付き最適化問題として定式化し、編集による更新を能力損失の景観における低曲率部分空間へ射影することで、この制約を強制する。CrispEditの核心は、能力制約をブレグマン距離によって表現することにある。その二次形式は、ガウス-ニュートンヘッセ行列を正確に与え、たとえ基底モデルが収束まで学習されていない場合でも同様である。我々は、クロネッカー分解近似曲率（K-FAC）と、大規模な射影行列の構築を回避するためにクロネッカー構造を利用する新規の行列フリー射影器を用いて、この二階の手続きをLLM規模で効率的に実行する。標準的なモデル編集ベンチマークにおいて、CrispEditは高い編集成功率を達成し、データセット全体で平均して能力劣化を1%未満に抑え、従来の編集手法を大幅に改善する。

English

A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

CrispEdit: スケーラブルな非破壊的LLM編集のための低曲率射影

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

要旨

Support