多层感知器的扩展：归纳偏差的故事

摘要

在这项工作中，我们重新审视了深度学习中最基本的构建模块，即多层感知器（MLP），并研究了它在视觉任务上性能的极限。对MLP的经验性见解出于多种原因至关重要。（1）鉴于变压器超越卷积模型而流行的最近叙事“较少的归纳偏差更好”，自然而然地要探索这一假设的极限。为此，MLP提供了一个理想的测试平台，完全不受任何归纳偏差的影响。（2）由于它们的数学简单性，MLP几乎是深度学习理论文献中的主角，用作解释更复杂架构观察到的经验现象的代理。令人惊讶的是，在文献中很难找到有关MLP的实验数据点，尤其是当与大规模预训练协议结合时。实践和理论之间的这种差异令人担忧：MLP是否反映了实际模型展示的经验性进展？还是理论家需要重新思考MLP作为代理的角色？我们提供了对这两个方面的见解。我们展示了MLP的性能随规模的显著提高（在CIFAR10上为93％，在CIFAR100上为79％，在TinyImageNet上为69％），突显了缺乏归纳偏差确实可以得到补偿。我们观察到MLP忠实地模仿了其现代对应物的行为，然而在学习设置中，一些组件表现出更强或意想不到的行为。由于其固有的计算效率，大规模预训练实验对学术研究人员变得更加可行。我们的所有实验均在一台单独的GPU上运行。

English

In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.

多层感知器的扩展：归纳偏差的故事

Scaling MLPs: A Tale of Inductive Bias

摘要

Support