Transformer言語モデルの一般化における深さと幅の影響

要旨

新しい文を処理するためには、言語モデル（LMs）は合成的に一般化する必要があります――つまり、既知の要素を新しい方法で組み合わせる必要があります。モデルの構造のどの側面が合成的な一般化を促進するのでしょうか？トランスフォーマーに焦点を当て、最近の理論的および実証的研究に基づいて、トランスフォーマーは層が深い（より多くの層を持つ）場合に、より合成的に一般化するという仮説を検証します。単に層を追加するとパラメータの総数が増え、深さとサイズが混同されるため、総パラメータ数を一定（41M、134M、374Mパラメータ）に保つように、深さと幅をトレードオフする3つのクラスのモデルを構築します。すべてのモデルをLMsとして事前学習し、合成的な一般化をテストするタスクでファインチューニングします。主な結論として以下の3点を報告します：（1）ファインチューニング後、より深いモデルは、より浅いモデルよりも分布外でより良く一般化しますが、追加の層による相対的な利得は急速に減少します；（2）各ファミリー内で、より深いモデルはより良い言語モデリング性能を示しますが、利得は同様に減少します；（3）合成的な一般化における深さの利点は、言語モデリングや分布内データに対するより良い性能だけに起因するものではありません。

English

To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by recent theoretical and empirical work, that transformers generalize more compositionally when they are deeper (have more layers). Because simply adding layers increases the total number of parameters, confounding depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data.

Transformer言語モデルの一般化における深さと幅の影響

The Impact of Depth and Width on Transformer Language Model Generalization

要旨

Support