ChatPaper.aiChatPaper

GlimpRouter:通过单次思维令牌瞥视实现高效协同推理

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

January 8, 2026
作者: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu
cs.AI

摘要

大型推理模型(LRMs)通过显式生成多步思维链实现了卓越性能,但这种能力会带来显著的推理延迟和计算成本。协同推理通过将任务选择性地分配给轻量级模型与大型模型,提供了一种前景广阔的解决方案,然而核心挑战依然存在:如何判断某个推理步骤需要大型模型的强大能力还是小型模型的高效特性。现有路由策略要么依赖局部词元概率,要么采用事后验证机制,都会引入显著的推理开销。本文提出一种新颖的步进式协作视角:推理步骤的难度可以通过其首个词元进行推断。受大型推理模型中“顿悟时刻”现象的启发,我们发现初始词元的信息熵能有效预测步骤难度。基于此洞见,我们提出了GlimpRouter——一种免训练的步进式协作框架。该框架使用轻量级模型仅生成每个推理步骤的第一个词元,当初始词元熵值超过阈值时才将步骤路由至大型模型。在多基准测试上的实验表明,我们的方法在保持精度的同时显著降低了推理延迟。例如在AIME25基准上,GlimpRouter相比独立大型模型在精度提升10.7%的同时,推理延迟降低了25.9%。这些结果表明:基于思维掠影而非完整步骤评估的计算分配机制,是实现高效推理的简单而有效的路径。
English
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
PDF295January 31, 2026