直接マルチトークンデコーディング

要旨

デコーダのみのトランスフォーマーは、その優れた性能から大規模言語モデル（LLM）の標準的なアーキテクチャとなっている。最近の研究によると、事前学習済みのLLMにおいて、初期層、中間層、後期層はそれぞれ異なる役割を果たす可能性がある。初期層は入力コンテキストの理解に焦点を当て、中間層はタスク固有の処理を担当し、後期層は抽象的な表現を出力トークンに変換する。我々は、初期層と中間層によって処理された表現が、後期層のみを使用して複数のトークンを生成するのに十分な情報を隠れ状態に含んでいる可能性があると仮説を立てた。これにより、初期層と中間層を繰り返し通過する必要がなくなる。我々はこの推論パラダイムを「Direct Multi-Token Decoding（DMTD）」と呼ぶ。スペキュレーティブデコーディングとは異なり、この手法では追加のパラメータ、補助ルーチン、または生成後の検証を導入しない。限られたデータセットで学習されたにもかかわらず、ファインチューニングされたDMTD Qwen3-4Bモデルは、わずかな性能低下で最大2倍の高速化を達成し、有望な結果を示している。さらに、スケーリング分析が示すように、より大規模な学習データセットを使用することで、その性能はさらに向上することが期待される。

English

Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.