可變寬度變換器

摘要

擴展模型規模（特別是深度與寬度）已大幅推動基於Transformer的語言模型進展。然而，多數架構在所有層級維持固定寬度，即使不同層可能扮演不同的運算角色，仍均勻分配固定參數與計算預算。本研究透過提出沙漏形> <former架構，實證探討跨網路深度的非均勻容量分配。此設計在維持較寬的早期與晚期層的同時，收窄中間層，並採用無參數的殘差調整機制。在參數量從2億到20億（密集模型）及30億（混合專家模型）的解碼器專用語言模型中，我們的> <former在語言建模損失上持續優於參數匹配的均勻基線模型。透過降低平均層寬度，此架構亦減少整體FLOPs（在擬合損失匹配的縮放曲線下減少22%）以及更小的KV快取記憶體與I/O成本（減少15%）。分析中，我們顯示這種瓶頸結構會導致殘差流中出現質性不同的表徵。整體而言，我們的結果證明非均勻寬度分配能實現更具資源效率的語言模型擴展。

English

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.