AlignGuard-LoRA：通过费舍尔引导分解与黎曼-测地碰撞正则化实现对齐保持的微调

摘要

低秩适应（LoRA）已成为高效微调大型语言模型（LLMs）的标准工具。然而，即便是微小的LoRA更新也可能引发对齐漂移，通过参数间的纠缠变化削弱安全性和行为约束。针对这一问题，我们提出了AlignGuard-LoRA（AGL），一个在微调过程中保持对齐性的原则性框架。AGL引入了几个关键组件：用于监督的主任务损失、基于费舍尔信息矩阵的正则化以限制对齐敏感子空间中的更新，以及任务特定正则化以稳定新知识的整合。我们进一步引入了碰撞感知正则化，融合了黎曼重叠——惩罚坐标方向上的干扰——和测地分离——鼓励更新几何的独立性。我们构建了DriftCaps，一个针对安全与不安全提示的定向诊断基准，旨在量化对齐漂移和安全性下降。实证评估表明，AGL在安全关键基准上将对齐漂移减少了高达50%，且不影响下游任务性能。全面的消融实验证实，每个组件在保持潜在安全行为方面均有独特贡献。最后，我们推导并验证了灾难性遗忘的缩放定律，揭示AGL在保持适应动态的同时，减缓了微调后损失的上升。AGL是对LoRA的结构性优化，确保了对齐性的保持，同时实现了最小的权衡。为了鼓励进一步探索与开发，我们开源了我们的实现。

English

Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.

AlignGuard-LoRA：通过费舍尔引导分解与黎曼-测地碰撞正则化实现对齐保持的微调

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

摘要

Support