R2R：スモール・ラージモデルトークンルーティングによる多様な推論パスの効率的なナビゲーション

要旨

大規模言語モデル（LLMs）は、驚異的な推論能力を発揮する一方で、多大な推論オーバーヘッドを伴い、実用上の大きな課題を抱えています。蒸留された小規模言語モデル（SLMs）は効率性を大幅に向上させますが、LLMsの推論経路を追従できないため、性能が低下します。幸いなことに、LLMsとSLMsの間で推論経路が真に分岐するトークンはごく一部であることが明らかになりました。生成されるトークンの大部分は同一であるか、略語や表現のわずかな違いといった中立的な差異しか示しません。この知見を活用し、我々は**Roads to Rome (R2R)**を提案します。これは、重要な経路分岐トークンに対してのみLLMsを選択的に利用し、それ以外の大部分のトークン生成はSLMに任せるニューラルトークンルーティング手法です。また、分岐トークンを特定し、軽量ルーターを訓練するためのトークンレベルルーティングラベルを生成する自動データ生成パイプラインを開発しました。R2RをDeepSeekファミリーのR1-1.5BとR1-32Bモデルに適用し、数学、コーディング、QAベンチマークで評価しました。平均活性化パラメータサイズ5.6Bにおいて、R2RはR1-7Bの平均精度を1.6倍上回り、R1-14Bモデルをも凌駕しました。R1-32Bと比較すると、同等の性能を維持しつつ2.8倍のウォールクロック速度向上を実現し、テスト時のスケーリング効率のパレートフロンティアを前進させました。コードはhttps://github.com/thu-nics/R2Rで公開されています。

English

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

R2R：スモール・ラージモデルトークンルーティングによる多様な推論パスの効率的なナビゲーション

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

要旨

Support