Set Block Decoding은 언어 모델 추론 가속기입니다.

초록

자기회귀적 다음 토큰 예측 언어 모델은 강력한 능력을 제공하지만, 특히 디코딩 단계에서의 높은 계산 및 메모리 비용으로 인해 실제 배포 시 상당한 어려움에 직면합니다. 우리는 표준 다음 토큰 예측(NTP)과 마스크된 토큰 예측(MATP)을 단일 아키텍처 내에 통합하여 생성 속도를 가속화하는 간단하고 유연한 패러다임인 Set Block Decoding(SBD)을 소개합니다. SBD는 모델이 반드시 연속적이지 않은 여러 미래 토큰을 병렬로 샘플링할 수 있게 해주며, 이는 기존의 가속화 방법과의 주요 차이점입니다. 이러한 유연성은 이산 확산 문헌에서의 고급 솔버를 사용할 수 있게 하여 정확도를 희생하지 않고도 상당한 속도 향상을 제공합니다. SBD는 아키텍처 변경이나 추가 학습 하이퍼파라미터가 필요하지 않으며, 정확한 KV 캐싱과 호환성을 유지하고, 기존의 다음 토큰 예측 모델을 미세 조정하여 구현할 수 있습니다. Llama-3.1 8B와 Qwen-3 8B를 미세 조정함으로써, SBD가 동등한 NTP 학습과 동일한 성능을 달성하면서도 생성에 필요한 순방향 패스 횟수를 3-5배 감소시킬 수 있음을 입증합니다.

English

Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.

Set Block Decoding은 언어 모델 추론 가속기입니다.

Set Block Decoding is a Language Model Inference Accelerator

초록

Support