RelayLLM：基于协同解码的高效推理

摘要

大型语言模型（LLMs）在复杂推理任务中常受限于高昂的计算成本与延迟，而资源高效的小型语言模型（SLMs）通常又缺乏必要的推理能力。现有的协同方法（如级联或路由机制）采用粗粒度操作，将整个查询任务卸载给LLM处理，当SLM能够独立完成多数推理步骤时，这种模式会导致显著的计算资源浪费。为此，我们提出RelayLLM——一种通过令牌级协同解码实现高效推理的创新框架。与路由机制不同，RelayLLM使SLM成为主动控制器，仅通过特殊指令动态调用LLM处理关键令牌，实现生成过程的“接力式”协作。我们引入包含预热阶段和群组相对策略优化（GRPO）的两阶段训练框架，指导模型在自主推理与策略性求助之间取得平衡。在六个基准测试上的实验结果表明，RelayLLM平均准确率达到49.52%，有效弥合了两种模型间的性能差距。值得注意的是，该框架仅需为总生成令牌的1.07%调用LLM，相比性能匹配的随机路由器可实现98.2%的成本降低。

English

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

RelayLLM：基于协同解码的高效推理

RelayLLM: Efficient Reasoning via Collaborative Decoding

摘要

Support