ChatPaper.aiChatPaper

R2R:利用小型-大型模型令牌路由高效導航分歧推理路徑

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

May 27, 2025
作者: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
cs.AI

摘要

大型語言模型(LLMs)在展現出卓越推理能力的同時,也伴隨著顯著的推理開銷,這為其部署帶來了重大挑戰。儘管蒸餾後的小型語言模型(SLMs)顯著提升了效率,但其性能卻因無法遵循LLMs的推理路徑而受到影響。幸運的是,我們發現僅有少數token真正導致了LLMs與SLMs之間的推理路徑分歧。大多數生成的token要麼完全相同,要麼表現出中性差異,如縮寫或表達上的微小變化。基於這一洞察,我們提出了**羅馬之路(Roads to Rome, R2R)**,這是一種神經token路由方法,它選擇性地僅在這些關鍵的、導致路徑分歧的token上使用LLMs,而將大部分token生成任務交由SLM處理。我們還開發了一個自動數據生成管道,用於識別分歧token並生成token級別的路由標籤,以訓練輕量級路由器。我們將R2R應用於DeepSeek系列中的R1-1.5B和R1-32B模型,並在具有挑戰性的數學、編程和問答基準上進行評估。在平均激活參數規模為5.6B的情況下,R2R以1.6倍於R1-7B的平均準確率,甚至超越了R1-14B模型。與R1-32B相比,它在保持相當性能的同時,實現了2.8倍的實時加速,推進了測試時擴展效率的帕累托前沿。我們的代碼可在https://github.com/thu-nics/R2R 獲取。
English
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

Summary

AI-Generated Summary

PDF682May 29, 2025