推論効率の高いLLMのためのタンデムトランスフォーマー

要旨

従来の大規模言語モデル（LLM）の自己回帰的な性質は、トークンが逐次的に生成されるため、推論速度に本質的な制約をもたらします。推測的デコーディングや並列デコーディング技術はこれを緩和しようと試みていますが、いずれも限界があります。つまり、生成に精度の低い小型モデルに依存するか、ベースLLMの表現を十分に活用できないかのどちらかです。これらの課題を解決するため、我々は新しいアーキテクチャである「Tandem transformers」を提案します。このアーキテクチャは、(1) 小型の自己回帰モデルと、(2) ブロックモードで動作する大規模モデル（複数のトークンを同時に処理）を独自に組み合わせています。小型モデルの予測精度は、大規模モデルのより豊富な表現に注意を向けることで大幅に向上します。PaLM2の事前学習データセットにおいて、PaLM2-BisonとPaLM2-Geckoのタンデムモデルは、スタンドアロンのPaLM2-Geckoと比較して、次トークン予測精度が3.3%向上し、同等の下流タスク性能を持つPaLM2-Otterモデルと比べて1.16倍の高速化を実現しました。さらに、我々はこのタンデムモデルを推測的デコーディング（SPEED）フレームワークに組み込み、大規模モデルが小型モデルからのトークンを検証するようにしました。これにより、PaLM2-BisonとPaLM2-Geckoのタンデムモデルは、SPEEDで通常のPaLM2-Geckoを使用する場合と比べて約1.14倍の高速化を達成しつつ、下流タスクの精度を完全に維持することが可能になりました。

English

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.

推論効率の高いLLMのためのタンデムトランスフォーマー

Tandem Transformers for Inference Efficient LLMs

要旨

Summary

Support

Support