推测性流式处理：无需辅助模型的快速LLM推断

摘要

推测解码是一种突出的技术，可加速基于辅助草案模型预测的大型目标语言模型的推理过程。虽然在特定应用设置中有效，但通常需要微调草案和目标模型以实现高接受率。随着下游任务数量的增加，这些草案模型给推理系统增加了显著的复杂性。我们提出了一种称为推测流的单模型推测解码方法，通过将起草融入目标模型，将微调目标从下一个标记预测改为未来 n 克预测。推测流在各种任务中（如摘要、结构化查询和意义表示）中将解码加速了 1.8 - 3.1 倍，而不会牺牲生成质量。此外，推测流具有参数高效的特点。它实现了与 Medusa 风格架构相当/更高的加速，同时使用了约 10000 倍更少的额外参数，使其非常适合资源受限设备。

English

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

推测性流式处理：无需辅助模型的快速LLM推断

Speculative Streaming: Fast LLM Inference without Auxiliary Models

摘要

Support