S2D2：トレーニング不要な自己推論による拡散LLMの高速デコード

要旨

ブロック拡散言語モデルは、ブロック単位の自己回帰的復号化とブロック内並列デノイジングを組み合わせることで、自己回帰を超える高速生成への有望な道筋を提供する。しかし、実用的な高速化に必要な少ステップ体制では、標準的な信頼度閾値復号化は往々にして脆い：攻撃的な閾値は品質を損ない、保守的な閾値は不必要なデノイジングステップを要求する。この問題に対処する既存の手法は、追加の学習を必要とするか、推論時の計算コストを増大させる。本研究では、ブロック拡散言語モデルのための学習不要な自己投機的復号化フレームワークであるS2D2を提案する。我々の重要な観察は、ブロック拡散モデルはブロックサイズを1に縮小すると自己回帰的になることであり、これにより同一の事前学習モデルが起草モデルと検証モデルの両方として機能できる。S2D2は、標準的なブロック拡散復号化プロセスに投機的検証ステップを挿入し、軽量なルーティングポリシーを用いて検証コストが正当化される場面を判断する。これにより、拡散がトークンを並列提案し、自己回帰モードが局所的な系列レベル批評家として機能する、ハイブリッドな復号化軌道が実現する。3つの主流ブロック拡散モデルファミリーにわたり、S2D2は強力な信頼度閾値ベースラインを一貫して上回る精度-速度トレードオフの改善を示した。SDARでは、自己回帰復号化に対して最大4.7倍、調整済み動的復号化ベースラインに対して最大1.57倍の高速化を達成しつつ、精度を最大4.5ポイント向上させた。LLaDA2.1-Miniでは、S2D2は組み込みの自己補正機能と相補的であり、静的ベースラインより4.4倍高速で精度もわずかに高い保守的設定も実現した。

English

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7times speedup over autoregressive decoding, and up to 1.57times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4times faster than the static baseline with slightly higher accuracy.

S2D2：トレーニング不要な自己推論による拡散LLMの高速デコード

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

要旨

Support