ConFu：前瞻性思考赋能更优的推测性采样

摘要

推测解码技术通过采用轻量级草稿模型生成候选标记、再由目标模型验证的方式，已成为加速大语言模型推理的重要范式。该范式的有效性高度依赖于草稿模型的质量。尽管EAGLE系列等最新成果实现了最先进的加速效果，但现有草稿模型仍受限于错误累积问题：它们仅基于当前前缀进行条件生成，导致预测结果在多步生成中逐渐偏离目标模型。本文提出ConFu（展望未来）框架，创新性地使草稿模型能够预判生成过程的未来走向。ConFu包含三大核心设计：（i）引入具有最小开销的展望标记与软提示机制，使草稿模型能利用目标模型提供的未来导向信号；（ii）采用混合专家模型的动态展望标记机制，实现上下文感知的未来预测；（iii）结合锚点标记采样与未来预测复制的训练框架，学习稳健的未来预测能力。实验表明，在Llama-3 3B/8B模型的多项下游任务中，ConFu相比EAGLE-3将标记接受率和生成速度提升8-11%。本研究首次将推测解码与连续推理标记相融合，为加速大语言模型推理开辟了新方向。

English

Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.