Tandem Transformers 用於高效推論的LLM

摘要

傳統大型語言模型（LLMs）的自回歸特性固有地限制了推理速度，因為標記是按順序生成的。儘管投機性和並行解碼技術試圖緩解這一問題，但它們存在限制：要麼依賴於生成較不準確的較小模型，要麼未能充分利用基本LLM的表示。我們引入了一種新穎的架構，稱為串聯變壓器（Tandem transformers），以應對這些問題。該架構獨特地結合了（1）一個小的自回歸模型和（2）以塊模式運行的大型模型（同時處理多個標記）。通過讓小模型關注大模型更豐富的表示，小模型的預測準確性得到了顯著提高。在PaLM2預訓練數據集上，PaLM2-Bison和PaLM2-Gecko的串聯展示了比獨立的PaLM2-Gecko在下一個標記預測準確性提高了3.3％，相較於具有相當下游性能的PaLM2-Otter模型，速度提升了1.16倍。我們進一步將串聯模型納入投機解碼（SPEED）框架中，其中大模型驗證小模型生成的標記。這確保了PaLM2-Bison和PaLM2-Gecko的串聯實現了顯著的加速（比在SPEED中使用普通的PaLM2-Gecko快約1.14倍），同時保持了相同的下游任務準確性。

English

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.

Tandem Transformers 用於高效推論的LLM

Tandem Transformers for Inference Efficient LLMs

摘要

Summary

Support

Support