R2R:通过小-大模型令牌路由高效导航分歧推理路径
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
May 27, 2025
作者: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
cs.AI
摘要
大型语言模型(LLMs)在展现卓越推理能力的同时,也伴随着显著的推理开销,这给实际部署带来了巨大挑战。尽管蒸馏后的小型语言模型(SLMs)大幅提升了效率,但其性能却因无法遵循LLMs的推理路径而受到影响。幸运的是,我们发现,LLMs与SLMs之间真正导致推理路径分歧的仅是少数关键token,大多数生成的token要么完全相同,要么仅存在中性差异,如缩写或表达上的细微变化。基于这一洞察,我们提出了**罗马之路(R2R)**,一种神经token路由方法,它仅在处理这些关键的、路径分歧的token时选择性调用LLMs,而将大部分token生成任务交由SLM完成。我们还开发了一个自动数据生成管道,用于识别分歧token并生成token级别的路由标签,以训练轻量级路由器。我们将R2R应用于DeepSeek家族的R1-1.5B与R1-32B模型,并在数学、编程和问答等挑战性基准上进行了评估。在平均激活参数量为5.6B的情况下,R2R以1.6倍的优势超越了R1-7B的平均准确率,甚至优于R1-14B模型。与R1-32B相比,它在保持相当性能的同时,实现了2.8倍的实时加速,推动了测试时扩展效率的帕累托前沿。我们的代码已发布于https://github.com/thu-nics/R2R。
English
Large Language Models (LLMs) achieve impressive reasoning capabilities at the
cost of substantial inference overhead, posing substantial deployment
challenges. Although distilled Small Language Models (SLMs) significantly
enhance efficiency, their performance suffers as they fail to follow LLMs'
reasoning paths. Luckily, we reveal that only a small fraction of tokens
genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens
are either identical or exhibit neutral differences, such as minor variations
in abbreviations or expressions. Leveraging this insight, we introduce **Roads
to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs
only for these critical, path-divergent tokens, while leaving the majority of
token generation to the SLM. We also develop an automatic data generation
pipeline that identifies divergent tokens and generates token-level routing
labels to train the lightweight router. We apply R2R to combine R1-1.5B and
R1-32B models from the DeepSeek family, and evaluate on challenging math,
coding, and QA benchmarks. With an average activated parameter size of 5.6B,
R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the
R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with
comparable performance, advancing the Pareto frontier of test-time scaling
efficiency. Our code is available at https://github.com/thu-nics/R2R.Summary
AI-Generated Summary