ChatPaper.aiChatPaper

重探Transformer语言模型的架构惯例

Revisiting the Shape Convention of Transformer Language Models

February 6, 2026
作者: Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu
cs.AI

摘要

稠密Transformer语言模型长期遵循着统一的架构形态:每层由注意力模块后接窄-宽-窄多层感知机(MLP)构成的前馈网络(FFN)组成,其中大部分参数分配在扩展比为2至4的MLP部分。受近期残差宽-窄-宽(沙漏形)MLP具有更优函数逼近能力的研究启发,我们重新审视Transformer中长期沿用的MLP形态惯例,对窄-宽-窄设计的必要性提出质疑。为此,我们开发了一种Transformer变体,用更深的沙漏形FFN替代传统FFN——该结构由多个通过残差路径连接的沙漏形子MLP堆叠而成。我们提出,更深度但更轻量的沙漏形FFN可作为传统FFN的竞争性替代方案,且通过使用轻量化沙漏FFN节省的参数可被更有效地利用,例如在固定预算下扩大模型隐藏维度。我们通过不同模型规模的实证验证证实了这些观点:沙漏FFN在4亿参数规模以下优于传统FFN,在10亿参数规模上达到相当性能;在同等预算下,减少FFN参数并增加注意力参数的沙漏FFN变体相较传统配置展现出持续改进。这些发现为近期研究提供了新视角,促使我们重新思考窄-宽-窄MLP惯例以及注意力与FFN间的平衡关系,以构建更高效、表达能力更强的现代语言模型。
English
Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.
PDF42March 16, 2026