트랜스포머 언어 모델의 형태 규칙 재고찰

초록

고밀도 트랜스포머 언어 모델은 대체로 일관된 아키텍처 형태를 고수해왔습니다: 각 계층은 어텐션 모듈과 이어지는 피드-포워드 네트워크(FFN)로 구성되며, FFN은 좁은-넓은-좁은(narrow-wide-narrow) 구조의 MLP를 갖추고 대부분의 매개변수를 MLP에 할당합니다(일반적으로 확장 비율은 2~4 사이). 최근 잔차 연결을 사용한 넓은-좁은-넓은(모래시계형) MLP가 우수한 함수 근사 능력을 제공한다는 연구 결과에 동기를 받아, 우리는 트랜스포머의 오랜 MLP 구조 관례를 재검토하며 좁은-넓은-좁은 설계의 필요성에 의문을 제기합니다. 이를 연구하기 위해 우리는 기존 FFN을 더 깊은 모래시계 형태의 FFN으로 대체한 트랜스포머 변형 모델을 개발했습니다. 이 FFN은 잔차 경로로 연결된 여러 개의 모래시계형 서브-MLP를 쌓아 구성됩니다. 우리는 더 깊지만 가벼운 모래시계형 FFN이 기존 FFN에 대한 경쟁력 있는 대안이 될 수 있으며, 가벼운 모래시계형 FFN 사용으로 절약된 매개변수는 고정된 예산 내에서 모델의 은닉 차원을 확대하는 등 더 효과적으로 활용될 수 있다고 가정합니다. 우리는 다양한 모델 규모에 대한 실증적 검증을 통해 이를 확인했습니다: 모래시계형 FFN은 4억 매개변수 규모까지 기존 FFN을 능가하며, 10억 매개변수에 이르는 더 큰 규모에서도 비슷한 성능을 달성했습니다; FFN 매개변수를 줄이고 어텐션 매개변수를 증가시킨 모래시계형 FFN 변형 모델들은 동일 예산 대비 기존 구성보다 일관되게 향상된 성능을 보여주었습니다. 이러한 결과들은 최근 연구에 새로운 시각을 제공하며, 효율적이고 표현력이 풍부한 현대적 언어 모델을 위해 좁은-넓은-좁은 MLP 관례와 어텐션 및 FFN 간의 균형에 대한 재고를 촉구합니다.

English

Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.

트랜스포머 언어 모델의 형태 규칙 재고찰

Revisiting the Shape Convention of Transformer Language Models

초록

Support