指数级更快的语言建模

摘要

语言模型在进行单个推理时实际上只需要使用其神经元的指数分数。作为证明，我们提出了FastBERT，这是一种BERT变体，在推理过程中仅使用其神经元的0.3\%，同时表现与类似的BERT模型相当。FastBERT在每个层推理过程中仅选择了4095个神经元中的12个。这是通过将前馈网络替换为快速前馈网络（FFFs）来实现的。虽然目前还没有真正高效的实现来释放条件神经执行的全部加速潜力，但我们提供了高级CPU代码，实现了比优化基准前馈实现快78倍的加速，并提供了一个PyTorch实现，其推理速度比等效批处理前馈推理快40倍。我们公开了我们的训练代码、基准测试设置和模型权重。

English

Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present FastBERT, a BERT variant that uses 0.3\% of its neurons during inference while performing on par with similar BERT models. FastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

指数级更快的语言建模

Exponentially Faster Language Modelling

摘要

Support