ChatPaper.aiChatPaper

μLO:學習優化器的計算效率元泛化

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

May 31, 2024
作者: Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
cs.AI

摘要

學習優化器(LOs)可以顯著減少神經網絡的牆鐘訓練時間,從而大幅降低訓練成本。然而,它們通常在元泛化方面表現不佳,尤其是在訓練比元訓練中看到的更大的網絡時。為了解決這個問題,我們使用了最近提出的最大更新參數化(muP),該方法允許從較小的模型到較大模型的優化器超參數的零次泛化。我們將muP理論擴展到學習優化器,將元訓練問題視為在muP下找到學習優化器。我們的評估顯示,使用muP進行元訓練的LOs在元泛化方面顯著優於在標準參數化(SP)下訓練的LOs。值得注意的是,當應用於寬度較大的模型時,我們最佳的muLO,在訓練了103個GPU小時後,與VeLO的性能相匹配或超越,VeLO是最大的公開可用學習優化器,經過4000個TPU月份的計算進行了元訓練。此外,與它們的SP對應物相比,muLOs對於更深的網絡和比元訓練期間長25倍的訓練時間範圍(長得多)表現出更好的泛化能力。
English
Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they often suffer from poor meta-generalization, especially when training networks larger than those seen during meta-training. To address this, we use the recently proposed Maximal Update Parametrization (muP), which allows zero-shot generalization of optimizer hyperparameters from smaller to larger models. We extend muP theory to learned optimizers, treating the meta-training problem as finding the learned optimizer under muP. Our evaluation shows that LOs meta-trained with muP substantially improve meta-generalization as compared to LOs trained under standard parametrization (SP). Notably, when applied to large-width models, our best muLO, trained for 103 GPU-hours, matches or exceeds the performance of VeLO, the largest publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. Moreover, muLOs demonstrate better generalization than their SP counterparts to deeper networks and to much longer training horizons (25 times longer) than those seen during meta-training.

Summary

AI-Generated Summary

PDF130December 12, 2024