LayerSkip:实现提前退出推断和自我推测解码
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
April 25, 2024
作者: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
cs.AI
摘要
我们提出了LayerSkip,这是一个端到端的解决方案,用于加速大型语言模型(LLMs)的推断。首先,在训练过程中,我们应用层丢弃,对于较早的层采用较低的丢弃率,对于较晚的层采用较高的丢弃率,并且采用早期退出损失,其中所有变压器层共享相同的退出。其次,在推断过程中,我们展示了这种训练配方增加了在较早层的早期退出的准确性,而无需向模型添加任何辅助层或模块。第三,我们提出了一种新颖的自我推测解码解决方案,在这种解决方案中,我们在早期层退出,然后使用模型的其余层进行验证和校正。我们提出的自我推测解码方法比其他推测解码方法具有更小的内存占用,并且受益于草稿和验证阶段的共享计算和激活。我们在不同的Llama模型大小上进行了实验,采用不同类型的训练:从头开始的预训练、持续预训练、在特定数据领域上微调,以及在特定任务上微调。我们实现了我们的推断解决方案,并展示了在CNN/DM文档的摘要、编码以及TOPv2语义解析任务上的加速效果,分别达到了2.16倍、1.82倍和2.0倍。我们在https://github.com/facebookresearch/LayerSkip 上开源了我们的代码和检查点。
English
We present LayerSkip, an end-to-end solution to speed-up inference of large
language models (LLMs). First, during training we apply layer dropout, with low
dropout rates for earlier layers and higher dropout rates for later layers, and
an early exit loss where all transformer layers share the same exit. Second,
during inference, we show that this training recipe increases the accuracy of
early exit at earlier layers, without adding any auxiliary layers or modules to
the model. Third, we present a novel self-speculative decoding solution where
we exit at early layers and verify and correct with remaining layers of the
model. Our proposed self-speculative decoding approach has less memory
footprint than other speculative decoding approaches and benefits from shared
compute and activations of the draft and verification stages. We run
experiments on different Llama model sizes on different types of training:
pretraining from scratch, continual pretraining, finetuning on specific data
domain, and finetuning on specific task. We implement our inference solution
and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x
on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and
checkpoints at https://github.com/facebookresearch/LayerSkip.Summary
AI-Generated Summary