ChatPaper.aiChatPaper

AlignGuard-LoRA:基於費雪引導分解與黎曼-測地線碰撞正則化的對齊保持微調

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

August 4, 2025
作者: Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
cs.AI

摘要

低秩適應(LoRA)已成為高效微調大型語言模型(LLM)的標準工具。然而,即便是微小的LoRA更新也可能引發對齊漂移,通過糾纏的參數變化削弱安全性和行為約束。為解決這一問題,我們提出了對齊守護-LoRA(AGL),這是一個在微調過程中保持對齊的原則性框架。AGL引入了幾個關鍵組件:用於監督的主任務損失、基於費舍爾信息矩陣的正則化以限制在對齊敏感子空間中的更新,以及任務特定的正則化以穩定新知識的整合。我們進一步引入了碰撞感知正則化,融合了黎曼重疊——懲罰坐標方向上的干擾——和測地線分離——鼓勵不相交的更新幾何。我們精心設計了DriftCaps,這是一個針對安全和危險提示的診斷基準,旨在量化對齊漂移和安全性退化。實證評估表明,AGL在安全關鍵基準上將對齊漂移減少了高達50%,而不降低下游任務的性能。全面的消融實驗證實,每個組件在保持潛在安全行為方面都有獨特貢獻。最後,我們推導並驗證了災難性遺忘的縮放定律,揭示了AGL在保持適應動態的同時,平抑了微調後的損失上升。AGL是LoRA的一種結構性改進,確保了對齊的保持,且僅需最小的權衡。為鼓勵進一步探索和開發,我們開源了我們的實現。
English
Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.
PDF22August 6, 2025