LoRA最適化におけるスケーリング係数の隠れた力

要旨

低ランク適応（Low-Rank Adaptation, LoRA）において、スケーリング係数αはしばしば学習率の単なる補完として扱われるが、その最適化における役割は未だ十分に理解されていない。本論文では、スケーリング係数αと学習率が異なる機能を持つことを明らかにし、αが効果的最適化の主要な推進力として現れ、学習率のスケーリングだけでは再現できない利得をもたらすことを示す。広範な経験的分析と理論的な信号-ドリフト枠組みの相乗効果を通じて、我々はLoRAのスケーリングメカニズムに関する3つの発見を明らかにする。第一に、LoRAのスペクトル抑制は最適化地形を平滑化し、標準的なハイパーパラメータを過度に保守的にし、最適化ギャップを生み出す。第二に、この平滑性を活用して収束を加速する場合、αはタスク信号を増幅し、ドリフト比を増加させることなく学習率を上回る性能を発揮する。第三に、最適なスケーリング係数はランクと劣線形関係にあり、予想外に大きな係数を持つ平方根則によってよく特徴づけられ、既存のランク連動ヒューリスティックスのスケーリングが不十分であることを明らかにする。これらの知見に基づき、我々はLoRA-αを提案する。これはαを原理的な領域に戻す最小限の枠組みであり、LoRAを標準的な小さな学習率と互換性を持たせる。多様なタスクにわたる広範な評価により、LoRA-αがハイパーパラメータ探索を効率化しつつ一貫して性能を向上させ、LoRAの学習可能性を解放することを実証する。

English

In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.