R2R: 소형-대형 모델 토큰 라우팅을 통해 다양한 추론 경로를 효율적으로 탐색

초록

대규모 언어 모델(LLMs)은 인상적인 추론 능력을 달성하지만, 상당한 추론 오버헤드로 인해 배포에 상당한 어려움을 겪고 있습니다. 소규모 언어 모델(SLMs)을 증류하면 효율성이 크게 향상되지만, LLMs의 추론 경로를 따르지 못해 성능이 저하됩니다. 다행히, 우리는 LLMs와 SLMs 간의 추론 경로가 실제로 달라지는 토큰이 극히 일부에 불과하다는 사실을 발견했습니다. 생성된 대부분의 토큰은 동일하거나 약어나 표현의 사소한 차이와 같은 중립적인 차이를 보입니다. 이러한 통찰을 바탕으로, 우리는 **Roads to Rome (R2R)**이라는 신경망 토큰 라우팅 방법을 제안합니다. 이 방법은 중요한 경로 분기 토큰에 대해서만 LLMs를 선택적으로 활용하고, 대부분의 토큰 생성을 SLM에 맡깁니다. 또한, 우리는 분기 토큰을 식별하고 토큰 수준의 라우팅 레이블을 생성하여 경량 라우터를 훈련시키는 자동 데이터 생성 파이프라인을 개발했습니다. R2R을 DeepSeek 패밀리의 R1-1.5B와 R1-32B 모델에 적용하여 수학, 코딩, QA 벤치마크에서 평가했습니다. 평균 활성화 매개변수 크기가 5.6B인 R2R은 R1-7B의 평균 정확도를 1.6배 초과하며, R1-14B 모델을 능가했습니다. R1-32B와 비교했을 때, 비슷한 성능을 유지하면서 2.8배의 벽시계 속도 향상을 달성하여 테스트 시간 확장 효율성의 파레토 프론티어를 발전시켰습니다. 우리의 코드는 https://github.com/thu-nics/R2R에서 확인할 수 있습니다.

English

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

R2R: 소형-대형 모델 토큰 라우팅을 통해 다양한 추론 경로를 효율적으로 탐색

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

초록

Support