O Poder Oculto do Fator de Escala na Otimização LoRA

Resumo

Na Adaptação de Baixo Posto (LoRA), o fator de escala α é frequentemente tratado como um mero complemento à taxa de aprendizado, mas seu papel na otimização permanece pouco compreendido. Neste artigo, revelamos que o fator de escala α e a taxa de aprendizado funcionam de forma distinta, com α emergindo como o principal impulsionador da otimização eficaz, proporcionando ganhos que não podem ser replicados apenas pelo escalonamento da taxa de aprendizado. Por meio da sinergia entre uma extensa análise empírica e um arcabouço teórico de Deriva de Sinal, descobrimos três aspectos do mecanismo de escalonamento da LoRA: Primeiro, a supressão espectral da LoRA suaviza a paisagem de otimização, tornando os hiperparâmetros padrão excessivamente conservadores e criando uma lacuna de otimização. Segundo, ao aproveitar essa suavidade para acelerar a convergência, α supera a taxa de aprendizado ao amplificar o sinal da tarefa sem aumentar a razão de deriva. Terceiro, o fator de escala ótimo segue uma relação sublinear com o posto, bem caracterizada por uma lei de raiz quadrada com um coeficiente surpreendentemente grande, revelando o escalonamento insuficiente das heurísticas existentes vinculadas ao posto. Com base nesses insights, propomos LoRA-α, um framework minimalista que restaura α ao seu regime fundamentado, tornando a LoRA compatível com taxas de aprendizado pequenas padrão. Avaliações extensas em diversas tarefas demonstram que o LoRA-α melhora consistentemente o desempenho enquanto simplifica a busca de hiperparâmetros, liberando o potencial de aprendizado da LoRA.

English

In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.