ChatPaper.aiChatPaper

AlignGuard-LoRA:通过费舍尔引导分解与黎曼-测地碰撞正则化实现对齐保持的微调

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

August 4, 2025
作者: Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
cs.AI

摘要

低秩适应(LoRA)已成为高效微调大型语言模型(LLMs)的标准工具。然而,即便是微小的LoRA更新也可能引发对齐漂移,通过参数间的纠缠变化削弱安全性和行为约束。针对这一问题,我们提出了AlignGuard-LoRA(AGL),一个在微调过程中保持对齐性的原则性框架。AGL引入了几个关键组件:用于监督的主任务损失、基于费舍尔信息矩阵的正则化以限制对齐敏感子空间中的更新,以及任务特定正则化以稳定新知识的整合。我们进一步引入了碰撞感知正则化,融合了黎曼重叠——惩罚坐标方向上的干扰——和测地分离——鼓励更新几何的独立性。我们构建了DriftCaps,一个针对安全与不安全提示的定向诊断基准,旨在量化对齐漂移和安全性下降。实证评估表明,AGL在安全关键基准上将对齐漂移减少了高达50%,且不影响下游任务性能。全面的消融实验证实,每个组件在保持潜在安全行为方面均有独特贡献。最后,我们推导并验证了灾难性遗忘的缩放定律,揭示AGL在保持适应动态的同时,减缓了微调后损失的上升。AGL是对LoRA的结构性优化,确保了对齐性的保持,同时实现了最小的权衡。为了鼓励进一步探索与开发,我们开源了我们的实现。
English
Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.
PDF22August 6, 2025