Le pouvoir caché du facteur d'échelle dans l'optimisation LoRA

Résumé

Dans l'Adaptation de Bas-Rang (LoRA), le facteur d'échelle α est souvent traité comme un simple complément du taux d'apprentissage, mais son rôle dans l'optimisation reste mal compris. Dans cet article, nous révélons que le facteur d'échelle α et le taux d'apprentissage agissent différemment, α s'imposant comme le moteur dominant de l'optimisation effective, apportant des gains que la seule mise à l'échelle du taux d'apprentissage ne peut reproduire. Grâce à la synergie d'une analyse empirique approfondie et d'un cadre théorique Signal-Dérive, nous découvrons trois constats sur le mécanisme de mise à l'échelle de LoRA : premièrement, la suppression spectrale de LoRA lisse le paysage d'optimisation, rendant les hyperparamètres standards trop conservateurs et créant un écart d'optimisation. Deuxièmement, en tirant parti de ce lissage pour accélérer la convergence, α surpasse le taux d'apprentissage en amplifiant le signal de la tâche sans augmenter le rapport de dérive. Troisièmement, le facteur d'échelle optimal suit une relation sous-linéaire avec le rang, bien caractérisée par une loi en racine carrée avec un coefficient étonnamment grand, révélant une mise à l'échelle insuffisante des heuristiques existantes liées au rang. À partir de ces observations, nous proposons LoRA-α, un cadre minimaliste qui rétablit α dans son régime de principe, rendant LoRA compatible avec de petits taux d'apprentissage standards. Des évaluations approfondies sur diverses tâches montrent que LoRA-α améliore systématiquement les performances tout en simplifiant la recherche d'hyperparamètres, libérant ainsi le potentiel d'apprentissage de LoRA.

English

In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.