LayerSkip: 早期終了推論と自己投機的デコードを可能にする

要旨

本論文では、大規模言語モデル（LLM）の推論を高速化するためのエンドツーエンドソリューションであるLayerSkipを提案します。まず、トレーニング中にレイヤードロップアウトを適用し、初期層では低いドロップアウト率、後続層では高いドロップアウト率を設定します。さらに、すべてのトランスフォーマーレイヤーが同じ出口を共有する早期出口損失を導入します。次に、推論時には、このトレーニング手法が、モデルに補助的なレイヤーやモジュールを追加することなく、初期層での早期出口の精度を向上させることを示します。第三に、初期層で出口し、残りのレイヤーで検証と修正を行う新しい自己推測的デコードソリューションを提案します。提案する自己推測的デコードアプローチは、他の推測的デコードアプローチよりもメモリフットプリントが小さく、ドラフト段階と検証段階の計算とアクティベーションを共有する利点があります。Llamaモデルのさまざまなサイズで、スクラッチからの事前学習、継続的な事前学習、特定のデータドメインでのファインチューニング、特定のタスクでのファインチューニングなど、異なるタイプのトレーニングを実施しました。推論ソリューションを実装し、CNN/DMドキュメントの要約では最大2.16倍、コーディングでは1.82倍、TOPv2セマンティックパースタスクでは2.0倍の高速化を実現しました。コードとチェックポイントはhttps://github.com/facebookresearch/LayerSkipで公開しています。

English

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and checkpoints at https://github.com/facebookresearch/LayerSkip.

LayerSkip: 早期終了推論と自己投機的デコードを可能にする

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

要旨

Summary

Support

Support