ConFu:以未来思辨提升推测采样效能
ConFu: Contemplate the Future for Better Speculative Sampling
March 9, 2026
作者: Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
cs.AI
摘要
推理性解码已成为加速大语言模型推理的重要范式,该方法通过轻量级草稿模型生成候选标记,再由目标模型进行验证。该范式的效能关键取决于草稿模型的质量。尽管EAGLE系列等最新成果实现了最先进的加速效果,但现有草稿模型仍受限于错误累积问题:它们仅基于当前前缀进行条件预测,导致其预测结果在多步生成后逐渐偏离目标模型。本文提出ConFu(展望未来)这一创新推理性解码框架,使草稿模型能够预判生成过程的未来走向。ConFu引入三大核心技术:(1)通过展望标记与软提示机制,使草稿模型能以可忽略的代价利用目标模型提供的未来导向信号;(2)采用混合专家模型的动态展望标记机制,实现上下文感知的未来预测;(3)结合锚点标记采样与未来预测复制的训练框架,学习稳健的未来预测能力。实验表明,在Llama-3 3B/8B模型的各种下游任务中,ConFu相较EAGLE-3将标记接受率与生成速度提升了8-11%。本研究首次将推理性解码与连续推理标记相融合,为加速大语言模型推理开辟了新方向。
English
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.