神经网络在实践中到底有多灵活？

摘要

广泛认为神经网络可以拟合包含至少与其参数数量相同样本的训练集，支持过度参数化和不足参数化模型的概念。然而，在实践中，我们只能找到通过我们的训练过程（包括优化器和正则化器）可访问的解决方案，这限制了灵活性。此外，函数类的精确参数化，内置于架构中，塑造了其损失曲面并影响我们发现的极小值。在这项工作中，我们研究神经网络在实践中拟合数据的能力。我们的研究结果表明：（1）标准优化器找到的极小值只能拟合具有明显较少样本数量的训练集的模型；（2）卷积网络在随机标记数据上比MLP和ViTs更具参数效率；（3）尽管随机训练被认为具有正则化效果，但随机梯度下降实际上找到比全批量梯度下降更多训练数据的极小值；（4）拟合正确标记和错误标记样本的能力差异可以预测泛化能力；（5）ReLU激活函数导致找到更多数据的极小值，尽管它们被设计用于避免深度架构中的梯度消失和梯度爆炸。

English

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

神经网络在实践中到底有多灵活？

Just How Flexible are Neural Networks in Practice?

摘要

Support