AlignGuard-LoRA：基於費雪引導分解與黎曼-測地線碰撞正則化的對齊保持微調

摘要

低秩適應（LoRA）已成為高效微調大型語言模型（LLM）的標準工具。然而，即便是微小的LoRA更新也可能引發對齊漂移，通過糾纏的參數變化削弱安全性和行為約束。為解決這一問題，我們提出了對齊守護-LoRA（AGL），這是一個在微調過程中保持對齊的原則性框架。AGL引入了幾個關鍵組件：用於監督的主任務損失、基於費舍爾信息矩陣的正則化以限制在對齊敏感子空間中的更新，以及任務特定的正則化以穩定新知識的整合。我們進一步引入了碰撞感知正則化，融合了黎曼重疊——懲罰坐標方向上的干擾——和測地線分離——鼓勵不相交的更新幾何。我們精心設計了DriftCaps，這是一個針對安全和危險提示的診斷基準，旨在量化對齊漂移和安全性退化。實證評估表明，AGL在安全關鍵基準上將對齊漂移減少了高達50%，而不降低下游任務的性能。全面的消融實驗證實，每個組件在保持潛在安全行為方面都有獨特貢獻。最後，我們推導並驗證了災難性遺忘的縮放定律，揭示了AGL在保持適應動態的同時，平抑了微調後的損失上升。AGL是LoRA的一種結構性改進，確保了對齊的保持，且僅需最小的權衡。為鼓勵進一步探索和開發，我們開源了我們的實現。

English

Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.

AlignGuard-LoRA：基於費雪引導分解與黎曼-測地線碰撞正則化的對齊保持微調

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

摘要

Support