계획적 확산

초록

대규모 언어 모델 추론에서의 핵심 과제는 생성 속도와 출력 품질 간의 균형을 맞추는 것입니다. 자기회귀 모델은 고품질의 텍스트를 생성하지만 토큰을 순차적으로 생성합니다. 확산 모델은 토큰을 병렬로 생성할 수 있지만 동일한 품질을 달성하기 위해 많은 반복이 필요합니다. 우리는 두 패러다임의 장점을 결합한 하이브리드 방법인 계획적 확산(planned diffusion)을 제안합니다. 계획적 확산은 두 단계로 작동합니다: 첫째, 모델이 출력을 더 작고 독립적인 구간으로 나누는 짧은 자기회귀 계획을 생성합니다. 둘째, 모델이 확산을 사용하여 이러한 구간을 동시에 생성합니다. 이 접근 방식은 속도-품질 파레토 최적 경계를 확장하고 더 빠르고 고품질의 텍스트 생성을 위한 실용적인 경로를 제공합니다. 805개의 명령어 수행 프롬프트로 구성된 AlpacaEval에서, 계획적 확산은 품질과 지연 시간 간의 파레토 최적 균형을 달성하며, 자기회귀 생성 대비 1.27배에서 1.81배의 속도 향상을 보이면서 승률은 각각 0.87%에서 5.4%만 감소했습니다. 우리의 민감도 분석은 계획적 확산의 계획 메커니즘이 최소화되고 신뢰할 수 있으며, 품질-지연 시간 균형을 유연하게 제어할 수 있는 간단한 런타임 조절 장치가 있음을 보여줍니다.

English

A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.

계획적 확산

Planned Diffusion

초록

Support