POSS：位置專家生成更佳草稿以進行推測解碼

摘要

推測解碼技術通過使用小型草稿模型預測多個標記，並由大型目標模型並行驗證這些標記，從而加速大型語言模型（LLM）的推理過程。近期研究利用目標模型的隱藏狀態來提升草稿模型的預測準確性。然而，現有方法因草稿模型生成特徵中的錯誤累積，導致後續位置草稿標記預測質量下降。本文提出位置專家（PosS），由多個專注於特定位置（或多個位置）的草稿層組成，用於生成指定位置的標記。位置專家大幅提高了每輪草稿生成中後續位置的標記接受率，因為每個專家只需專注於處理特定程度的草稿模型特徵偏差。在Llama-3-8B-Instruct和Llama-2-13B-chat模型上，針對六個數據集的實驗結果表明，PosS在平均接受長度和加速比方面均有效超越了基線方法。我們的代碼庫已公開於https://github.com/shrango/PosS。

English

Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.

POSS：位置專家生成更佳草稿以進行推測解碼

POSS: Position Specialist Generates Better Draft for Speculative Decoding

摘要

Support