VIA-SD: 投機的デコードのためのモデル内ルーティングによる検証

要旨

投機的デコーディング（SD）は、軽量なドラフターが候補を生成し、大規模な検証器がそれを並列に検証することで、LLMの高い推論コストに対処する。既存のドラフト検証手法では、受理するか完全に再計算するかの二値的な判断を用いる。しかし、我々は、棄却された多くのトークンが、完全な検証器の代わりに、モデル内ルーティングを介して完全な検証器から派生したスリムなサブモデルによって正しく検証できることを発見した。このことから、我々のスリム検証器は、中程度の検証リソースを必要とするトークンを処理し、高コストな大規模モデルの呼び出しを減らすように動機づけられた。我々は、投機的デコーディングのためのモデル内ルーティングによる検証（VIA-SD）を提案する。これは、ルーティングされたスリム検証器を用いる多段階フレームワークである。ドラフトトークンは階層的に処理される。高信頼度のケースでは直接受理、中程度の信頼度のケースではスリム検証器による再生成、不確実なケースでは完全なモデルによる検証が行われる。 4つの代表的なタスクと複数のモデルファミリーにおいて、VIA-SDは棄却率を0.10〜0.22削減し、強力なSDベースラインと比較して10〜20%の高速化を達成し、同時にドラフティングなしのデコーディングと比較して2.5〜3倍の高速化を実現する。さらに、VIA-SDは既存のSDフレームワークと互換性があり、その学習手順を変更する必要がない。我々の結果は、スケーラブルで効率的なLLM推論のための一般的なパラダイムとして、多段階SDを示唆している。プロジェクトページ：https://zju-xyc.github.io/VIA-SD-Project-Page/

English

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/