无需训练的高效多词元预测：基于嵌入空间探测的新方法

摘要

尽管大型语言模型（LLMs）仅接受下一词元生成训练，却展现出潜在的多元词元预测（MTP）能力。我们提出一种无需训练的简易MTP方法，通过从模型嵌入空间动态提取掩码词元进行探测，无需修改模型权重或依赖辅助草案模型即可实现未来词元的并行预测。该方法通过从掩码词元逻辑值中采样Top-K候选构建推测式词元树，并采用轻量级剪枝策略保留高概率延续序列。在解码过程中，候选预测结果经并行验证，在实现无损生成的同时显著减少模型调用次数并提升词元吞吐量。在多项基准测试中，基于探测的MTP方法始终优于现有无需训练的基线模型：在LLaMA3上接受长度提升约12%，在Qwen3上提升8-12%，吞吐量增益最高达15-19%。最后，我们通过理论分析与实证表明，解码器层能自然对齐掩码词元表征与下一词元状态，无需重新训练或辅助模型即可实现精准的多步预测。

English

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.