ChatPaper.aiChatPaper

μLO:学习优化器的计算高效元泛化

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

May 31, 2024
作者: Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
cs.AI

摘要

学习优化器(LOs)可以显著减少神经网络的挂钟训练时间,从而大幅降低训练成本。然而,它们在元泛化方面通常表现不佳,特别是在训练比元训练中看到的更大的网络时。为了解决这个问题,我们使用了最近提出的最大更新参数化(muP),它允许从较小模型到较大模型的零-shot泛化优化器超参数。我们将muP理论扩展到学习优化器,将元训练问题视为在muP下找到学习优化器。我们的评估表明,使用muP进行元训练的LOs在元泛化方面明显优于在标准参数化(SP)下训练的LOs。值得注意的是,当应用于大宽度模型时,我们最佳的muLO,在进行了103 GPU小时的训练后,与VeLO的性能相匹配或超过,VeLO是最大的公开可用学习优化器,经过了4000 TPU月的计算进行了元训练。此外,与它们的SP对应物相比,muLOs对更深的网络和比元训练中看到的训练时间跨度长得多的情况(长25倍)表现出更好的泛化能力。
English
Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they often suffer from poor meta-generalization, especially when training networks larger than those seen during meta-training. To address this, we use the recently proposed Maximal Update Parametrization (muP), which allows zero-shot generalization of optimizer hyperparameters from smaller to larger models. We extend muP theory to learned optimizers, treating the meta-training problem as finding the learned optimizer under muP. Our evaluation shows that LOs meta-trained with muP substantially improve meta-generalization as compared to LOs trained under standard parametrization (SP). Notably, when applied to large-width models, our best muLO, trained for 103 GPU-hours, matches or exceeds the performance of VeLO, the largest publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. Moreover, muLOs demonstrate better generalization than their SP counterparts to deeper networks and to much longer training horizons (25 times longer) than those seen during meta-training.

Summary

AI-Generated Summary

PDF130December 12, 2024