ChatPaper.aiChatPaper

RelayGen:面向高效推理的生成过程内模型切换技术

RelayGen: Intra-Generation Model Switching for Efficient Reasoning

February 6, 2026
作者: Jiwon Song, Yoongon Kim, Jae-Joon Kim
cs.AI

摘要

大型推理模型(LRMs)通过生成冗长的多步推理轨迹在复杂推理任务中表现出色,但推理时的规模扩展会带来高昂的部署成本。核心挑战在于单个输出中的生成难度存在差异,而现有效率优化方法要么忽略这种生成过程中的动态变化,要么依赖具有高系统复杂度的监督式令牌级路由。我们提出RelayGen——一种无需训练、基于片段级的运行时模型切换框架,可有效利用长链推理中的难度波动特性。通过使用令牌概率边际对生成不确定性进行离线分析,我们发现粗粒度的片段级控制足以捕捉推理轨迹中的难度转换节点。RelayGen通过识别模型特定的切换信号(指示进入低难度片段的转换点),动态将后续生成任务委托给轻量化模型,同时保留大模型对高难度推理片段的处理能力。在多个推理基准测试中,RelayGen在保持大模型绝大部分准确率的同时,显著降低了推理延迟。当与推测解码技术结合时,该框架可实现最高2.2倍的端到端加速,且准确率损失不足2%,无需额外训练或学习路由组件。
English
Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present RelayGen, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2times end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.
PDF112February 11, 2026