RelayLLM:透過協同解碼實現高效推理
RelayLLM: Efficient Reasoning via Collaborative Decoding
January 8, 2026
作者: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
cs.AI
摘要
大型語言模型在處理複雜推理任務時,常因高昂的計算成本與延遲問題而受限,而資源效率較高的小型語言模型通常又缺乏必要的推理能力。現有的協作方法(如級聯或路由機制)僅能將完整查詢粗粒度地卸載至大型模型,當小型模型實際能處理多數推理步驟時,這種做法會導致顯著的計算資源浪費。為解決此問題,我們提出RelayLLM——一種透過詞元級協同解碼實現高效推理的新框架。有別於路由機制,RelayLLM使小型模型能作為主動控制器,僅在關鍵詞元處透過特殊指令動態調用大型模型,實現生成過程的「接力傳遞」。我們設計了兩階段訓練框架,包含熱身訓練與群組相對策略優化,使模型學會在自主生成與策略性求助間取得平衡。在六項基準測試中的實證結果表明,RelayLLM平均準確率達49.52%,有效彌合了兩種模型的性能差距。值得注意的是,該框架僅需對總生成詞元的1.07%調用大型模型,與性能匹配的隨機路由器相比可降低98.2%的成本。
English
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.