RelayGen:面向高效推理的生成过程中模型切换技术
RelayGen: Intra-Generation Model Switching for Efficient Reasoning
February 6, 2026
作者: Jiwon Song, Yoongon Kim, Jae-Joon Kim
cs.AI
摘要
大型推理模型(LRMs)通过生成多步骤的长推理轨迹在复杂推理任务中表现优异,但推理时的扩展性会带来高昂的部署成本。核心挑战在于单个输出中的生成难度存在差异,而现有效率优化方法要么忽略这种生成过程内的难度波动,要么依赖具有高系统复杂度的监督式令牌级路由。我们提出RelayGen——一种无需训练、基于片段级的运行时模型切换框架,该框架利用长链推理中的难度变化特性。通过基于令牌概率边际的生成不确定性离线分析,我们发现粗粒度的片段级控制足以捕捉推理轨迹中的难度转换节点。RelayGen能识别模型特定的切换信号,这些信号标志着推理进入低难度片段,并动态将其续写任务委派给轻量模型,同时保留高难度推理任务由大模型处理。在多项推理基准测试中,RelayGen在保持大模型绝大部分精度的同时,显著降低了推理延迟。当与推测解码技术结合时,RelayGen可实现最高2.2倍的端到端加速,且精度损失小于2%,无需额外训练或学习路由组件。
English
Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present RelayGen, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2times end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.