R2R:利用小型-大型模型令牌路由高效導航分歧推理路徑
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
May 27, 2025
作者: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
cs.AI
摘要
大型語言模型(LLMs)在展現出卓越推理能力的同時,也伴隨著顯著的推理開銷,這為其部署帶來了重大挑戰。儘管蒸餾後的小型語言模型(SLMs)顯著提升了效率,但其性能卻因無法遵循LLMs的推理路徑而受到影響。幸運的是,我們發現僅有少數token真正導致了LLMs與SLMs之間的推理路徑分歧。大多數生成的token要麼完全相同,要麼表現出中性差異,如縮寫或表達上的微小變化。基於這一洞察,我們提出了**羅馬之路(Roads to Rome, R2R)**,這是一種神經token路由方法,它選擇性地僅在這些關鍵的、導致路徑分歧的token上使用LLMs,而將大部分token生成任務交由SLM處理。我們還開發了一個自動數據生成管道,用於識別分歧token並生成token級別的路由標籤,以訓練輕量級路由器。我們將R2R應用於DeepSeek系列中的R1-1.5B和R1-32B模型,並在具有挑戰性的數學、編程和問答基準上進行評估。在平均激活參數規模為5.6B的情況下,R2R以1.6倍於R1-7B的平均準確率,甚至超越了R1-14B模型。與R1-32B相比,它在保持相當性能的同時,實現了2.8倍的實時加速,推進了測試時擴展效率的帕累托前沿。我們的代碼可在https://github.com/thu-nics/R2R 獲取。
English
Large Language Models (LLMs) achieve impressive reasoning capabilities at the
cost of substantial inference overhead, posing substantial deployment
challenges. Although distilled Small Language Models (SLMs) significantly
enhance efficiency, their performance suffers as they fail to follow LLMs'
reasoning paths. Luckily, we reveal that only a small fraction of tokens
genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens
are either identical or exhibit neutral differences, such as minor variations
in abbreviations or expressions. Leveraging this insight, we introduce **Roads
to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs
only for these critical, path-divergent tokens, while leaving the majority of
token generation to the SLM. We also develop an automatic data generation
pipeline that identifies divergent tokens and generates token-level routing
labels to train the lightweight router. We apply R2R to combine R1-1.5B and
R1-32B models from the DeepSeek family, and evaluate on challenging math,
coding, and QA benchmarks. With an average activated parameter size of 5.6B,
R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the
R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with
comparable performance, advancing the Pareto frontier of test-time scaling
efficiency. Our code is available at https://github.com/thu-nics/R2R.Summary
AI-Generated Summary