LayerSkip：啟用早期退出推論和自我推測解碼

摘要

我們提出了LayerSkip，這是一個端對端解決方案，可加速大型語言模型（LLMs）的推論速度。首先，在訓練期間，我們應用層丟棄，較早的層使用較低的丟棄率，而較後的層使用較高的丟棄率，以及一個早期退出損失，其中所有變壓器層共享同一退出點。其次，在推論期間，我們展示了這種訓練配方如何增加在較早層的早期退出的準確性，而無需向模型添加任何輔助層或模塊。第三，我們提出了一種新穎的自我推測解碼解決方案，在此解決方案中，我們在較早層退出，並使用模型的其餘層進行驗證和修正。我們提出的自我推測解碼方法比其他推測解碼方法具有更小的記憶體占用量，並且受益於起草和驗證階段的共享計算和激活。我們對不同Llama模型大小進行了實驗，涉及不同類型的訓練：從頭開始的預訓練、持續預訓練、在特定數據領域上進行微調，以及在特定任務上進行微調。我們實現了我們的推論解決方案，並展示了在CNN/DM文檔摘要上高達2.16倍的加速，編碼上的1.82倍加速，以及TOPv2語義解析任務上的2.0倍加速。我們在https://github.com/facebookresearch/LayerSkip 上開源我們的代碼和檢查點。

English

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and checkpoints at https://github.com/facebookresearch/LayerSkip.

LayerSkip：啟用早期退出推論和自我推測解碼

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

摘要

Support