LayerSkip：实现提前退出推断和自我推测解码

摘要

我们提出了LayerSkip，这是一个端到端的解决方案，用于加速大型语言模型（LLMs）的推断。首先，在训练过程中，我们应用层丢弃，对于较早的层采用较低的丢弃率，对于较晚的层采用较高的丢弃率，并且采用早期退出损失，其中所有变压器层共享相同的退出。其次，在推断过程中，我们展示了这种训练配方增加了在较早层的早期退出的准确性，而无需向模型添加任何辅助层或模块。第三，我们提出了一种新颖的自我推测解码解决方案，在这种解决方案中，我们在早期层退出，然后使用模型的其余层进行验证和校正。我们提出的自我推测解码方法比其他推测解码方法具有更小的内存占用，并且受益于草稿和验证阶段的共享计算和激活。我们在不同的Llama模型大小上进行了实验，采用不同类型的训练：从头开始的预训练、持续预训练、在特定数据领域上微调，以及在特定任务上微调。我们实现了我们的推断解决方案，并展示了在CNN/DM文档的摘要、编码以及TOPv2语义解析任务上的加速效果，分别达到了2.16倍、1.82倍和2.0倍。我们在https://github.com/facebookresearch/LayerSkip 上开源了我们的代码和检查点。

English

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and checkpoints at https://github.com/facebookresearch/LayerSkip.

LayerSkip：实现提前退出推断和自我推测解码

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

摘要

Summary

Support

Support