可变宽度变换器

摘要

缩放模型规模，特别是深度和宽度，显著推动了基于Transformer的语言模型的发展。然而，大多数架构在所有层中保持恒定宽度，即便不同层可能承担不同的计算角色，仍将固定的参数和计算预算均匀分配。在本研究中，我们通过提出一种异形（><former）架构，实验性地探究了网络深度上的非均匀容量分配。该设计保持早期和后期层较宽，同时收窄中间层，并采用无参数的残差缩放机制。在从2亿到20亿参数（密集）以及30亿参数（混合专家）的仅解码器语言模型中，我们的><former架构在语言建模损失上持续优于参数匹配的均匀基线模型。通过降低平均层宽度，该架构还减少了总FLOPs（在拟合损失匹配的缩放曲线下减少22%）以及更小的KV缓存内存和I/O成本（减少15%）。在分析中，我们表明这种瓶颈结构在残差流中产生了定性不同的表示。总体而言，我们的结果证明，非均匀宽度分配可以实现更资源最优的语言模型缩放。

English

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.