MARS: 自己回帰モデルによるマルチトークン生成の実現

要旨

自己回帰（AR）言語モデルは、連続するトークンが前の文脈から高度に予測可能な場合でも、一度に1トークンずつテキストを生成します。本論文では、命令チューニングされたARモデルが1回のフォワードパスで複数のトークンを予測するように学習する軽量なファインチューニング手法、MARS（Mask AutoRegreSsive）を提案します。MARSはアーキテクチャの変更や追加パラメータを一切必要とせず、元のARモデルと全く同じ方法で呼び出せる単一モデルを生成し、性能劣化もありません。ターゲットモデルとは別にドラフトモデルを維持する speculative decoding や、Medusaのような追加の予測ヘッドを付加するマルチヘッドアプローチとは異なり、MARSは既存の命令データを用いた継続学習のみを必要とします。1フォワードパス当たり1トークンを生成する場合、MARSは6つの標準ベンチマークでARベースラインを匹敵または上回ります。1ステップで複数のトークンを受け入れることを許可した場合、ベースラインレベルの精度を維持しつつ1.5～1.7倍のスループットを達成します。さらに、バッチ推論のためのブロックレベルKVキャッシュ戦略を開発し、Qwen2.5-7BにおいてKVキャッシュ付きARと比較して最大1.71倍の実時間速度向上を実現しました。最後に、MARSは信頼度閾値設定によるリアルタイムの速度調整をサポートします：高負荷時でも、モデルの交換や再起動なしにスループットを動的に向上でき、実用的なレイテンシと品質の調整機能をデプロイに提供します。

English

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

MARS: 自己回帰モデルによるマルチトークン生成の実現

MARS: Enabling Autoregressive Models Multi-Token Generation

要旨

Support