MARS：實現自回歸模型的多令牌生成

摘要

自迴歸（AR）語言模型每次僅生成一個標記，即使後續標記可根據上文高度預測。本文提出MARS（遮罩自迴歸方法），這種輕量級微調技術能指導指令調優後的AR模型在單次前向傳播中預測多個標記。MARS無需修改模型架構或增加參數，所得單一模型仍可完全沿用原始AR模型的調用方式且性能無損。有別於需額外維護草稿模型的推理性解碼，或如Medusa等多頭預測的附加頭方法，MARS僅需對現有指令數據進行持續訓練。在單標記生成模式下，MARS在六項標準基準測試中達到或超越AR基準線；當允許單步接收多標記時，能在保持基準精確度的同時實現1.5-1.7倍吞吐量提升。我們進一步開發了面向批次推理的區塊級KV快取策略，在Qwen2.5-7B上較帶KV快取的AR模型實現最高1.71倍實時加速。最後，MARS支援透過置信度閾值進行即時速度調節：在高負載場景下，服務系統無需切換模型或重啟即可動態提升吞吐量，為部署提供實用的延遲-質量調控機制。

English

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

MARS：實現自回歸模型的多令牌生成

MARS: Enabling Autoregressive Models Multi-Token Generation

摘要

Support