ConFu: より優れた投機的サンプリングのための未来考察

要旨

推測的デコーディングは、軽量なドラフトモデルを用いて候補トークンを提案し、それをターゲットモデルが検証する手法として、大規模言語モデル（LLM）の推論加速に有効なアプローチとして注目されている。このパラダイムの有効性は、ドラフトモデルの品質に大きく依存する。EAGLEシリーズなどの最近の進歩により最先端の高速化を実現しているものの、既存のドラフトモデルは誤差蓄積の問題に制限されている。つまり、現在のプレフィックスのみを条件とするため、ステップを重ねるごとにターゲットモデルからの予測が乖離してしまうのである。本研究では、ConFu（Contemplate the Future）という新しい推測的デコーディングフレームワークを提案する。ConFuは、ドラフトモデルが生成の将来の方向性を予測することを可能にする。ConFuは以下を導入する：(i) ごくわずかなコストでターゲットモデルからの将来的な信号をドラフトモデルが利用できるようにする「熟考トークン」とソフトプロンプト、(ii) MoEを活用した文脈を考慮した将来予測を可能にする動的熟考トークン機構、(iii) ロバストな将来予測を学習するためのアンカートークンサンプリングと将来予測複製を含む訓練フレームワークである。実験により、ConFuがLlama-3 3Bおよび8Bモデルを用いた様々な下流タスクにおいて、EAGLE-3と比較してトークン受理率と生成速度を8～11%向上させることが実証された。本研究は、推測的デコーディングと連続的な推論トークンを初めて結びつけたものであり、LLM推論の加速に向けた新たな方向性を提供するものと信じる。

English

Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

ConFu: より優れた投機的サンプリングのための未来考察

ConFu: Contemplate the Future for Better Speculative Sampling

要旨

Support