MARS: 자기회귀 모델의 다중 토큰 생성 활성화

초록

자동회귀(AR) 언어 모델은 이전 맥락을 고려할 때 연속적인 토큰이 매우 예측 가능한 경우에도 한 번에 하나의 토큰씩 텍스트를 생성합니다. 본 논문에서는 지령어 튜닝된 AR 모델이 순전파(forward pass)당 여러 토큰을 예측하도록 가르치는 경량 미세조정(fine-tuning) 방법인 MARS(Mask AutoRegreSsive)를 소개합니다. MARS는 구조적 수정이나 추가 매개변수를 도입하지 않으며, 원본 AR 모델과 동일한 방식으로 호출 가능하고 성능 저하 없이 단일 모델을 생성합니다. 타겟 모델과 별도의 초안(draft) 모델을 유지하는 추측 디코딩(specellative decoding)이나 추가 예측 헤드를 부착하는 Medusa와 같은 다중 헤드 접근법과 달리, MARS는 기존 지령어 데이터에 대한 지속적 훈련만으로 구현됩니다. 순전파당 하나의 토큰을 생성할 때, MARS는 6가지 표준 벤치마크에서 AR 기준선을 유지하거나 능가합니다. 단계당 여러 토큰을 허용할 경우, 기준선 수준의 정확도를 유지하면서 1.5-1.7배의 처리량(throughput)을 달성합니다. 또한 배치 추론을 위한 블록 수준 KV 캐싱 전략을 개발하여 Qwen2.5-7B에서 KV 캐시를 사용한 AR 대비 최대 1.71배의 실제(wall-clock) 속도 향상을 보였습니다. 마지막으로, MARS는 신뢰도 임계값 설정(confidence thresholding)을 통한 실시간 속도 조절을 지원합니다: 높은 요청 부하 시, 서빙 시스템은 모델 교체나 재시작 없이 즉시 처리량을 증가시켜 배치를 위한 실용적인 지연-품질(latency-quality) 조절 기능을 제공합니다.

English

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

MARS: 자기회귀 모델의 다중 토큰 생성 활성화

MARS: Enabling Autoregressive Models Multi-Token Generation

초록

Support