AlignGuard-LoRA:基於費雪引導分解與黎曼-測地線碰撞正則化的對齊保持微調
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
August 4, 2025
作者: Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
cs.AI
摘要
低秩適應(LoRA)已成為高效微調大型語言模型(LLM)的標準工具。然而,即便是微小的LoRA更新也可能引發對齊漂移,通過糾纏的參數變化削弱安全性和行為約束。為解決這一問題,我們提出了對齊守護-LoRA(AGL),這是一個在微調過程中保持對齊的原則性框架。AGL引入了幾個關鍵組件:用於監督的主任務損失、基於費舍爾信息矩陣的正則化以限制在對齊敏感子空間中的更新,以及任務特定的正則化以穩定新知識的整合。我們進一步引入了碰撞感知正則化,融合了黎曼重疊——懲罰坐標方向上的干擾——和測地線分離——鼓勵不相交的更新幾何。我們精心設計了DriftCaps,這是一個針對安全和危險提示的診斷基準,旨在量化對齊漂移和安全性退化。實證評估表明,AGL在安全關鍵基準上將對齊漂移減少了高達50%,而不降低下游任務的性能。全面的消融實驗證實,每個組件在保持潛在安全行為方面都有獨特貢獻。最後,我們推導並驗證了災難性遺忘的縮放定律,揭示了AGL在保持適應動態的同時,平抑了微調後的損失上升。AGL是LoRA的一種結構性改進,確保了對齊的保持,且僅需最小的權衡。為鼓勵進一步探索和開發,我們開源了我們的實現。
English
Low-rank adaptation (LoRA) has become a standard tool for efficiently
fine-tuning large language models (LLMs). Yet, even minor LoRA updates can
induce alignment drift, weakening safety and behavioral constraints through
entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL),
a principled framework for preserving alignment during finetuning. AGL
introduces several key components: a primary task loss for supervision, Fisher
Information Matrix-based regularization to restrict updates in
alignment-sensitive subspaces, and task-specific regularization to stabilize
the integration of new knowledge. We further introduce collision-aware
regularization, blending Riemannian overlap -- which penalizes coordinate-wise
interference -- and geodesic separation -- which encourages disjoint update
geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and
unsafe prompts designed to quantify alignment drift and safety degradation.
Empirical evaluations show that AGL mitigates alignment drift by up to 50% on
safety-critical benchmarks without degrading downstream task performance.
Comprehensive ablation confirms that each component contributes distinctly to
preserving latent safety behaviors. Finally, we derive and validate a scaling
law for catastrophic forgetting, revealing that AGL flattens post-finetuning
loss escalation while preserving adaptation dynamics. AGL is a structurally
grounded refinement of LoRA, ensuring alignment preservation with minimal
trade-offs. To encourage further exploration and development, we open-source
our implementation.