可变宽度变换器
Variable-Width Transformers
June 16, 2026
作者: Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim
cs.AI
摘要
缩放模型规模,特别是深度和宽度,显著推动了基于Transformer的语言模型的发展。然而,大多数架构在所有层中保持恒定宽度,即便不同层可能承担不同的计算角色,仍将固定的参数和计算预算均匀分配。在本研究中,我们通过提出一种异形(><former)架构,实验性地探究了网络深度上的非均匀容量分配。该设计保持早期和后期层较宽,同时收窄中间层,并采用无参数的残差缩放机制。在从2亿到20亿参数(密集)以及30亿参数(混合专家)的仅解码器语言模型中,我们的><former架构在语言建模损失上持续优于参数匹配的均匀基线模型。通过降低平均层宽度,该架构还减少了总FLOPs(在拟合损失匹配的缩放曲线下减少22%)以及更小的KV缓存内存和I/O成本(减少15%)。在分析中,我们表明这种瓶颈结构在残差流中产生了定性不同的表示。总体而言,我们的结果证明,非均匀宽度分配可以实现更资源最优的语言模型缩放。
English
Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.