POSS：位置专家为推测性解码生成更优草稿

摘要

推测解码通过利用小型草稿模型预测多个标记，并借助大型目标模型并行验证这些标记，从而加速大语言模型（LLM）的推理过程。近期研究利用目标模型的隐藏状态来提升草稿模型的预测准确性。然而，现有方法因草稿模型生成特征中的误差累积，导致后续位置草稿标记预测质量下降。本文提出位置专家（PosS）方法，该方法包含多个专为特定位置设计的草稿层，用于在指定位置生成标记。位置专家显著提高了每轮草稿生成中后续位置的标记接受率，因为每位专家仅需专注于处理特定程度的草稿模型特征偏差。在Llama-3-8B-Instruct和Llama-2-13B-chat模型上，跨越六个数据集的实验结果表明，PosS在平均接受长度和加速比方面均有效超越了基线方法。我们的代码库已发布于https://github.com/shrango/PosS。

English

Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.

POSS：位置专家为推测性解码生成更优草稿

POSS: Position Specialist Generates Better Draft for Speculative Decoding

摘要

Support