MLPのスケーリング：帰納的バイアスの物語

要旨

本研究では、深層学習の最も基本的な構成要素である多層パーセプトロン（MLP）を再検討し、視覚タスクにおけるその性能の限界を探る。MLPに関する実証的知見は、複数の理由から重要である。(1) 最近の「帰納的バイアスが少ないほど良い」という議論は、トランスフォーマーが畳み込みモデルを凌駕したことで広まったが、この仮説の限界を探ることは自然な流れである。その点で、MLPは帰納的バイアスを完全に排除した理想的なテストベッドを提供する。(2) MLPは数学的に単純であるため、深層学習の理論研究においてほぼ独占的に主役を務めており、より複雑なアーキテクチャで観察される実証的現象を説明するための代理として機能してきた。驚くべきことに、特に大規模な事前学習プロトコルと組み合わせた場合のMLPの実験データは、文献上非常に見つけにくい。この実践と理論の乖離は懸念すべきものである：MLPは実用的なモデルが示す実証的進歩を反映しているのか？それとも理論家はMLPの代理としての役割を再考する必要があるのか？我々はこれらの両面について洞察を提供する。MLPの性能はスケールに応じて劇的に向上することを示し（CIFAR10で93%、CIFAR100で79%、TinyImageNetで69%）、帰納的バイアスの欠如が確かに補償可能であることを強調する。MLPは現代の対応モデルの挙動を忠実に模倣するが、学習設定の一部の要素は驚くほど強力または予期せぬ挙動を示すことが観察された。その本質的な計算効率の高さにより、大規模な事前学習実験が学術研究者にとってよりアクセスしやすくなる。我々の全ての実験は単一のGPUで実行された。

English

In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.

MLPのスケーリング：帰納的バイアスの物語

Scaling MLPs: A Tale of Inductive Bias

要旨

Support