RelayGen: 効率的な推論のための世代内モデル切替

要旨

大規模推論モデル（LRM）は、長い多段階の推論軌道を生成することで複雑な推論タスクにおいて高い性能を達成するが、推論時のスケーリングには多大な導入コストが伴う。重要な課題は、生成の難易度が単一の出力内で変動することである。一方、効率性を重視した既存の手法は、この生成内変動を無視するか、高いシステム複雑性を伴う教師付きトークンレベルのルーティングに依存している。本論文では、長文推論における難易度変動を利用する、訓練不要なセグメントレベル実行時モデル切替フレームワーク「RelayGen」を提案する。トークン確率マージンを用いた生成不確実性のオフライン分析を通じて、粗い粒度のセグメントレベル制御が推論軌道内の難易度遷移を捉えるのに十分であることを示す。RelayGenは、低難易度セグメントへの遷移を示すモデル固有の切替キューを識別し、その継続をより小規模なモデルに動的に委譲する一方、高難易度の推論は大規模モデルで維持する。複数の推論ベンチマークにおいて、RelayGenは大規模モデルの精度を大部分維持しつつ、推論遅延を大幅に削減する。投機的デコーディングと組み合わせることで、RelayGenは追加の訓練や学習済みルーティングコンポーネントを必要とせず、精度劣化2％未満でエンドツーエンドの速度を最大2.2倍向上させる。

English

Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present RelayGen, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2times end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.

RelayGen: 効率的な推論のための世代内モデル切替

RelayGen: Intra-Generation Model Switching for Efficient Reasoning

要旨

Support