ChatPaper.aiChatPaper

深度混合:在基于Transformer的语言模型中动态分配计算资源

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

April 2, 2024
作者: David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro
cs.AI

摘要

基于Transformer的语言模型将浮点运算(FLOPs)均匀分布于输入序列中。本文展示了一种新方法,即Transformer能够学习动态地将FLOPs(或计算资源)分配给序列中的特定位置,从而在模型深度的不同层级上优化序列计算资源的分配。我们的方法通过限制每一层中能参与自注意力和多层感知机(MLP)计算的token数量(k)来确保总计算预算。具体处理哪些token由网络通过top-k路由机制决定。由于k值是预先设定的,这一过程使用的是静态计算图,其张量大小已知,与其它条件计算技术不同。尽管如此,由于k个token的身份是动态变化的,该方法能够在时间和模型深度维度上非均匀地分配FLOPs。因此,计算总支出是完全可预测的,但在token级别上是动态且上下文敏感的。采用这种方式训练的模型不仅能动态分配计算资源,而且效率极高。这些模型在同等FLOPs和训练时间下,性能与基准模型相当,但在每次前向传播中所需的FLOPs大幅减少,且在训练后采样时速度可提升高达50%。
English
Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

Summary

AI-Generated Summary

PDF1067November 26, 2024