深度和宽度对Transformer语言模型泛化的影响

摘要

为了处理新颖的句子，语言模型（LMs）必须具有组合泛化能力——以新的方式结合熟悉的元素。模型结构的哪些方面促进了组合泛化？针对Transformer，我们测试了一个假设，该假设受到最近理论和实证工作的启发，即当Transformer更深（具有更多层）时，它们更容易进行组合泛化。由于简单地增加层数会增加总参数数量，混淆了深度和规模，我们构建了三类模型，通过在深度和宽度之间进行权衡，使得总参数数量保持恒定（分别为4100万、1.34亿和3.74亿个参数）。我们将所有模型均作为LM进行预训练，并在测试组合泛化的任务上进行微调。我们得出三个主要结论：（1）微调后，更深的模型在分布外泛化比较浅的模型更好，但额外层的相对益处迅速减少；（2）在每个系列内，更深的模型表现出更好的语言建模性能，但回报同样减少；（3）深度对组合泛化的好处不能仅归因于在语言建模或分布内数据上的更好表现。

English

To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by recent theoretical and empirical work, that transformers generalize more compositionally when they are deeper (have more layers). Because simply adding layers increases the total number of parameters, confounding depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data.

深度和宽度对Transformer语言模型泛化的影响

The Impact of Depth and Width on Transformer Language Model Generalization

摘要

Support