MARS：实现自回归模型的多令牌生成

摘要

自回归语言模型逐令牌生成文本，即使后续令牌在给定上文语境下具有高度可预测性。我们提出MARS（掩码自回归）方法，通过轻量级微调使指令调优后的自回归模型具备单次前向预测多令牌的能力。该方法无需修改模型架构或增加参数，所得模型仍可完全兼容原始自回归模型的调用方式且性能无损。不同于需要额外维护草稿模型的推测解码技术，或类似Medusa等多头预测方案，MARS仅需在现有指令数据上继续训练。在单步单令牌生成模式下，MARS在六项标准基准测试中达到或超越自回归基线水平；当允许单步接收多令牌时，在保持基线精度的同时实现1.5-1.7倍吞吐量提升。我们进一步开发了面向批量推理的块级KV缓存策略，在Qwen2.5-7B模型上相比带KV缓存的自回归推理实现最高1.71倍实际加速。此外，MARS支持通过置信度阈值进行实时速度调节：在高负载场景下，服务系统无需切换模型或重启即可动态提升吞吐量，为实际部署提供了灵活的延迟-质量调节机制。

English

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.