深度和寬度對Transformer語言模型泛化的影響

摘要

為了處理新的句子，語言模型（LMs）必須具有組合泛化能力 - 將熟悉的元素以新的方式組合。模型結構的哪些方面促進了組合泛化？著重於Transformer，我們測試了一個假設，受最近的理論和實證工作啟發，即當Transformer更深（具有更多層）時，它們在組合泛化方面的泛化能力更強。由於僅僅增加層數會增加總參數數量，混淆了深度和大小，我們構建了三類模型，以深度和寬度相互折衷，使總參數數量保持恆定（分別為4100萬、1.34億和3.74億參數）。我們對所有模型進行語言模型的預訓練，並在測試組合泛化的任務上進行微調。我們報告了三個主要結論：（1）在微調後，較深的模型在超出分布範圍的泛化能力比較較淺的模型更好，但額外層數的相對好處迅速減少；（2）在每個系列中，較深的模型展現出更好的語言建模性能，但回報同樣在減少；（3）深度對於組合泛化的好處不能僅歸因於在語言建模或分布範圍內數據上的更好表現。

English

To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by recent theoretical and empirical work, that transformers generalize more compositionally when they are deeper (have more layers). Because simply adding layers increases the total number of parameters, confounding depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data.

深度和寬度對Transformer語言模型泛化的影響

The Impact of Depth and Width on Transformer Language Model Generalization

摘要

Support