推測性串流:無需輔助模型的快速LLM推論
Speculative Streaming: Fast LLM Inference without Auxiliary Models
February 16, 2024
作者: Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
cs.AI
摘要
推測解碼是一種顯著的技術,可加速基於輔助草稿模型預測的大型目標語言模型的推論。儘管在特定應用設置中非常有效,但通常需要微調草稿和目標模型以達到高接受率。隨著下游任務數量的增加,這些草稿模型為推斷系統增加了顯著的複雜性。我們提出了一種名為「推測串流」的單模型推測解碼方法,通過將起草融入目標模型,將微調目標從下一個令牌預測改為未來 n-gram 預測的目標。推測串流在各種任務中(如摘要、結構化查詢和意義表示)加快了解碼速度,速度提高了 1.8-3.1倍,同時不會犧牲生成質量。此外,推測串流具有參數效率。它實現了與 Medusa-style 結構相當/更高的加速,同時使用了約 10000倍少的額外參數,非常適合資源受限設備。
English
Speculative decoding is a prominent technique to speed up the inference of a
large target language model based on predictions of an auxiliary draft model.
While effective, in application-specific settings, it often involves
fine-tuning both draft and target models to achieve high acceptance rates. As
the number of downstream tasks grows, these draft models add significant
complexity to inference systems. We propose Speculative Streaming, a
single-model speculative decoding method that fuses drafting into the target
model by changing the fine-tuning objective from next token prediction to
future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 -
3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and
Meaning Representation, without sacrificing generation quality. Additionally,
Speculative Streaming is parameter-efficient. It achieves on-par/higher
speed-ups than Medusa-style architectures while using ~10000X fewer extra
parameters, making it well-suited for resource-constrained devices.