步長自適應：面向預算迭代訓練的統一學習率調度方案

摘要

不斷擴增的計算成本和有限的資源凸顯了預算迭代訓練的迫切需求，其目標是在預定的迭代預算內實現最佳學習。雖然學習率調度從根本上影響著不同網絡和任務的性能，特別是在預算迭代情境下，但其設計仍主要依賴於啟發式方法，缺乏理論基礎。此外，最佳學習率調度需要通過大量的試錯來選擇，這使得訓練過程效率低下。在本研究中，我們提出了統一預算感知（UBA）調度，這是一種基於理論的學習率調度方法，在不同受限訓練預算下，於多樣化的架構和任務中持續優於常用調度。首先，我們通過構建一個新穎的訓練預算感知優化框架來彌合這一差距，該框架明確考慮了對景觀曲率變化的魯棒性。從這一框架中，我們推導出UBA調度，它由單一超參數φ控制，在靈活性與簡潔性之間提供了一種平衡，消除了對每個網絡進行數值優化的需求。此外，我們建立了φ與條件數之間的理論聯繫，為我們的方法增添了解釋和合理性。同時，我們證明了不同φ值下的收斂性，並通過理論分析和實證結果提供了選擇φ的實用指南。大量的實驗結果表明，UBA在不同訓練迭代預算下，跨越網絡架構（如ResNet、OLMo）和規模的多樣化視覺和語言任務中，始終超越常用調度。

English

The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration budgets.While learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical foundations.In addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process inefficient.In this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training budgets.First, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature variations.From this framework, we derive the UBA schedule, controlled by a single hyper-parameter varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of varphi.We offer practical guidelines for its selection via theoretical analysis and empirical results.xtensive experimental results show that UBA consistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.