深度混合：在基于Transformer的语言模型中动态分配计算资源

摘要

基于Transformer的语言模型将浮点运算（FLOPs）均匀分布于输入序列中。本文展示了一种新方法，即Transformer能够学习动态地将FLOPs（或计算资源）分配给序列中的特定位置，从而在模型深度的不同层级上优化序列计算资源的分配。我们的方法通过限制每一层中能参与自注意力和多层感知机（MLP）计算的token数量（k）来确保总计算预算。具体处理哪些token由网络通过top-k路由机制决定。由于k值是预先设定的，这一过程使用的是静态计算图，其张量大小已知，与其它条件计算技术不同。尽管如此，由于k个token的身份是动态变化的，该方法能够在时间和模型深度维度上非均匀地分配FLOPs。因此，计算总支出是完全可预测的，但在token级别上是动态且上下文敏感的。采用这种方式训练的模型不仅能动态分配计算资源，而且效率极高。这些模型在同等FLOPs和训练时间下，性能与基准模型相当，但在每次前向传播中所需的FLOPs大幅减少，且在训练后采样时速度可提升高达50%。

English

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

深度混合：在基于Transformer的语言模型中动态分配计算资源

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

摘要

Support