LayerSkip:啟用早期退出推論和自我推測解碼
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
April 25, 2024
作者: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
cs.AI
摘要
我們提出了LayerSkip,這是一個端對端解決方案,可加速大型語言模型(LLMs)的推論速度。首先,在訓練期間,我們應用層丟棄,較早的層使用較低的丟棄率,而較後的層使用較高的丟棄率,以及一個早期退出損失,其中所有變壓器層共享同一退出點。其次,在推論期間,我們展示了這種訓練配方如何增加在較早層的早期退出的準確性,而無需向模型添加任何輔助層或模塊。第三,我們提出了一種新穎的自我推測解碼解決方案,在此解決方案中,我們在較早層退出,並使用模型的其餘層進行驗證和修正。我們提出的自我推測解碼方法比其他推測解碼方法具有更小的記憶體占用量,並且受益於起草和驗證階段的共享計算和激活。我們對不同Llama模型大小進行了實驗,涉及不同類型的訓練:從頭開始的預訓練、持續預訓練、在特定數據領域上進行微調,以及在特定任務上進行微調。我們實現了我們的推論解決方案,並展示了在CNN/DM文檔摘要上高達2.16倍的加速,編碼上的1.82倍加速,以及TOPv2語義解析任務上的2.0倍加速。我們在https://github.com/facebookresearch/LayerSkip 上開源我們的代碼和檢查點。
English
We present LayerSkip, an end-to-end solution to speed-up inference of large
language models (LLMs). First, during training we apply layer dropout, with low
dropout rates for earlier layers and higher dropout rates for later layers, and
an early exit loss where all transformer layers share the same exit. Second,
during inference, we show that this training recipe increases the accuracy of
early exit at earlier layers, without adding any auxiliary layers or modules to
the model. Third, we present a novel self-speculative decoding solution where
we exit at early layers and verify and correct with remaining layers of the
model. Our proposed self-speculative decoding approach has less memory
footprint than other speculative decoding approaches and benefits from shared
compute and activations of the draft and verification stages. We run
experiments on different Llama model sizes on different types of training:
pretraining from scratch, continual pretraining, finetuning on specific data
domain, and finetuning on specific task. We implement our inference solution
and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x
on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and
checkpoints at https://github.com/facebookresearch/LayerSkip.Summary
AI-Generated Summary