用于推理高效LLM的串联Transformer

摘要

传统大型语言模型（LLMs）的自回归特性固有地限制了推理速度，因为令牌是按顺序生成的。虽然投机性和并行解码技术试图缓解这一问题，但它们面临限制：要么依赖于生成较小模型的不太准确，要么未能充分利用基础LLM的表示。我们引入了一种新颖的架构，Tandem transformers，以解决这些问题。该架构独特地结合了（1）一个小的自回归模型和（2）以块模式运行的大型模型（同时处理多个令牌）。通过赋予小模型对大模型更丰富表示的关注，小模型的预测准确性得到了显著提高。在PaLM2预训练数据集上，PaLM2-Bison和PaLM2-Gecko的串联显示出比独立的PaLM2-Gecko在下一个令牌预测准确性方面提高了3.3％，与具有可比下游性能的PaLM2-Otter模型相比，速度提升了1.16倍。我们进一步将串联模型纳入投机解码（SPEED）框架中，其中大模型验证来自小模型的令牌。这确保了PaLM2-Bison和PaLM2-Gecko的串联实现了显著加速（比在SPEED中使用普通PaLM2-Gecko快约1.14倍），同时保持相同的下游任务准确性。

English

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.

用于推理高效LLM的串联Transformer

Tandem Transformers for Inference Efficient LLMs

摘要

Summary

Support

Support