神經網絡在實踐中到底有多靈活?
Just How Flexible are Neural Networks in Practice?
June 17, 2024
作者: Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson
cs.AI
摘要
廣泛認為神經網絡可以擬合包含至少與其參數數量相同的樣本的訓練集,支撐了過度參數化和不足參數化模型的概念。然而,在實踐中,我們只能通過我們的訓練程序找到可訪問的解決方案,包括優化器和正則化器,這限制了靈活性。此外,函數類的確切參數化,內建於架構中,塑造了其損失曲面並影響我們找到的極小值。在這項工作中,我們檢驗神經網絡在實踐中擬合數據的能力。我們的研究結果表明:(1) 標準優化器找到的極小值僅能擬合樣本數遠少於其參數數量的訓練集;(2) 卷積網絡在隨機標記數據上比MLP和ViTs更具參數效率;(3) 雖然隨機訓練被認為具有正則化效應,但SGD實際上找到比全批次梯度下降更多訓練數據的極小值;(4) 對於能否正確擬合樣本,正確標記和錯誤標記樣本之間的差異可以預測泛化能力;(5) ReLU激活函數導致找到能擬合更多數據的極小值,儘管其設計目的是避免在深度架構中出現梯度消失和梯度爆炸。
English
It is widely believed that a neural network can fit a training set containing
at least as many samples as it has parameters, underpinning notions of
overparameterized and underparameterized models. In practice, however, we only
find solutions accessible via our training procedure, including the optimizer
and regularizers, limiting flexibility. Moreover, the exact parameterization of
the function class, built into an architecture, shapes its loss surface and
impacts the minima we find. In this work, we examine the ability of neural
networks to fit data in practice. Our findings indicate that: (1) standard
optimizers find minima where the model can only fit training sets with
significantly fewer samples than it has parameters; (2) convolutional networks
are more parameter-efficient than MLPs and ViTs, even on randomly labeled data;
(3) while stochastic training is thought to have a regularizing effect, SGD
actually finds minima that fit more training data than full-batch gradient
descent; (4) the difference in capacity to fit correctly labeled and
incorrectly labeled samples can be predictive of generalization; (5) ReLU
activation functions result in finding minima that fit more data despite being
designed to avoid vanishing and exploding gradients in deep architectures.Summary
AI-Generated Summary