트랜스포머 언어 모델의 일반화에 대한 깊이와 너비의 영향

초록

새로운 문장을 처리하기 위해 언어 모델(LMs)은 구성적으로 일반화해야 합니다. 즉, 익숙한 요소들을 새로운 방식으로 결합해야 합니다. 모델의 구조 중 어떤 측면이 구성적 일반화를 촉진할까요? 트랜스포머에 초점을 맞추어, 최근의 이론적 및 실증적 연구에 의해 동기가 부여된 가설을 테스트합니다. 이 가설은 트랜스포머가 더 깊을수록(더 많은 레이어를 가질수록) 더 구성적으로 일반화한다는 것입니다. 단순히 레이어를 추가하면 총 매개변수 수가 증가하여 깊이와 크기가 혼동되기 때문에, 우리는 깊이와 너비를 교환하여 총 매개변수 수를 일정하게 유지하는 세 가지 클래스의 모델을 구성합니다(41M, 134M 및 374M 매개변수). 모든 모델을 언어 모델로 사전 학습하고 구성적 일반화를 테스트하는 작업에 대해 미세 조정합니다. 우리는 세 가지 주요 결론을 보고합니다: (1) 미세 조정 후, 더 깊은 모델은 더 얕은 모델보다 분포 외에서 더 잘 일반화하지만, 추가 레이어의 상대적 이점은 빠르게 감소합니다; (2) 각 패밀리 내에서, 더 깊은 모델은 더 나은 언어 모델링 성능을 보이지만, 이익은 유사하게 감소합니다; (3) 구성적 일반화를 위한 깊이의 이점은 단순히 언어 모델링이나 분포 내 데이터에서의 더 나은 성능으로만 귀속될 수 없습니다.

English

To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by recent theoretical and empirical work, that transformers generalize more compositionally when they are deeper (have more layers). Because simply adding layers increases the total number of parameters, confounding depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize better out-of-distribution than shallower models do, but the relative benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling or on in-distribution data.

트랜스포머 언어 모델의 일반화에 대한 깊이와 너비의 영향

The Impact of Depth and Width on Transformer Language Model Generalization

초록

Support