ChatPaper.aiChatPaper

渐缩语言模型

Tapered Language Models

June 22, 2026
作者: Reza Bayat, Ali Behrouz, Aaron Courville
cs.AI

摘要

现代语言模型,包括Transformer、循环网络及记忆增强变体,共享同一基础架构:由若干相同层级堆叠而成,参数在深度方向上均匀分配。这一设计继承自原始Transformer并延续至今,但越来越多的证据表明,各层级对最终输出的贡献并不均匀——后期层更多是细化残差流而非进行转换。我们提出疑问:参数容量是否应反映这种不对称性?受控实验表明,在固定预算下,相较于均匀宽度基线,将更多容量分配至早期层、更少容量分配至后期层可提升困惑度,而反向分配则有害。基于此结果,我们提出锥形语言模型(TLMs)这一架构原则,即参数承载组件在深度方向上按单调渐变方式分配,且总预算固定。MLP是实现这一机制的自然载体:它在所有现代LM系列中占据参数主导地位,且宽度作为单一、清晰的变化轴。在三种模型规模及四种架构(Transformer、门控注意力、Hope注意力与Titans)中,通过平滑余弦调度对MLP宽度进行锥形缩减,相较于均匀基线一致地提升了困惑度及下游基准性能,且无额外参数或计算开销。这些发现确立了深度感知的容量分配作为语言模型设计中一个简单、架构无关的设计轴——一个隐藏于显而易见之处的免费杠杆。
English
Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.