効率的なLLM推論のための報酬による先読みデコーディング

要旨

Reward-Guided Speculative Decoding（RSD）を導入します。これは、大規模言語モデル（LLMs）における推論の効率を向上させることを目的とした新しいフレームワークです。RSDは、より強力なターゲットモデルと軽量な下書きモデルを組み合わせ、高リワードの出力を優先する制御されたバイアスを取り入れることで、既存の推測デコーディング手法とは対照的に厳密な無バイアスを強制するものではありません。RSDは、プロセスリワードモデルを使用して中間のデコーディングステップを評価し、ターゲットモデルを呼び出すかどうかを動的に決定することで、計算コストと出力品質のトレードオフを最適化します。我々は、しきい値ベースの混合戦略がリソース利用とパフォーマンスの間で最適なバランスを達成することを理論的に示しています。オリンピアードレベルのタスクを含む厳しい推論ベンチマークでの包括的な評価により、RSDは、ターゲットモデルのみを使用したデコーディングと比較して、著しい効率の向上をもたらします（FLOP数が最大4.4倍少なくなる）、平均して並列デコーディング手法よりも著しく高い精度を達成します（最大+3.5）。これらの結果は、RSDがリソース集約的なシナリオでLLMsを展開するための堅牢で費用対効果の高いアプローチであることを示しています。

English

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

効率的なLLM推論のための報酬による先読みデコーディング

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

要旨

Support