ChatPaper.aiChatPaper

GlimpRouter:透過思維標記一瞥實現高效協同推理

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

January 8, 2026
作者: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu
cs.AI

摘要

大型推理模型(LRMs)通過顯式生成多步驟思維鏈實現了卓越性能,但這種能力會帶來顯著的推理延遲和計算成本。協作推理通過在輕量級模型與大型模型之間選擇性分配工作提供了有前景的解決方案,然而核心挑戰依然存在:如何判斷某個推理步驟何時需要大型模型的容量,何時只需小型模型的效率。現有的路由策略要么依賴局部標記概率,要么採用事後驗證機制,都會引入顯著的推理開銷。本研究提出一種步驟式協作的新視角:推理步驟的難度可從其首個標記推斷出來。受大型推理模型中「頓悟時刻」現象的啟發,我們發現初始標記的熵能有效預測步驟難度。基於此洞見,我們提出無需訓練的步驟式協作框架GlimpRouter。該框架利用輕量級模型僅生成每個推理步驟的首個標記,並在初始標記熵超過閾值時將步驟路由至大型模型。在多個基準測試上的實驗表明,我們的方法在保持準確率的同時顯著降低了推理延遲。例如在AIME25數據集上,GlimpRouter相比單一大型模型在準確率提升10.7%的同時,推理延遲降低了25.9%。這些結果揭示了一種簡單有效的推理機制:基於思維片段的驚鴻一瞥而非完整步驟評估來分配計算資源。
English
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
PDF295January 31, 2026