LoGAH: 1/100のパラメータで774MパラメータのTransformerを予測するグラフハイパーネットワーク

要旨

深層学習モデルの良好な初期化は、それらがより良く、より速く収束するのを助けるため、極めて重要です。しかし、大規模モデルの事前学習は多くの研究者にとって手の届かないものであり、初期パラメータの望ましい予測が今や一層必要とされています。モデルパラメータを予測するアプローチの一つであるGraph HyperNetworks（GHNs）は、最近、大規模な視覚モデルの初期化において強力な性能を示しています。残念ながら、非常に幅広いネットワークのパラメータを予測するには、小さなパラメータの塊を複数回コピーする必要があり、完全な予測をサポートするために極めて多くのパラメータが必要となるため、その実用化が大きく妨げられています。この制限に対処するため、我々はLoGAH（Low-rank GrAph Hypernetworks）を提案します。これは、低ランクのパラメータデコーダを備えたGHNであり、以前の試みほど過剰なパラメータの増加を必要とせずに、大幅に幅広いネットワークに拡張することができます。LoGAHにより、7億7400万の大規模ニューラルネットワークのパラメータをメモリ効率の良い方法で予測することが可能になります。我々は、LoGAHで初期化された視覚および言語モデル（すなわち、ViTおよびGPT-2）が、ランダムに初期化されたものや既存のハイパーネットワークを使用したものよりも優れた性能を達成することを示します。さらに、小さなデータセットでLoGAHをトレーニングし、予測されたパラメータを使用してより大きなタスクの初期化を行うという、有望な転移学習の結果を示します。我々は、コードをhttps://github.com/Blackzxy/LoGAHで提供しています。

English

A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and language models (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .

LoGAH: 1/100のパラメータで774MパラメータのTransformerを予測するグラフハイパーネットワーク

LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters

要旨

Support