최적의 다중 초안 스펙티브 디코딩을 향하여

초록

대규모 언어 모델(LLMs)은 자연어 처리 작업에서 필수적인 요소가 되었습니다. 그러나 자기회귀적 샘플링은 효율성의 병목 현상으로 작용하고 있습니다. 최근 제안된 다중 초안 추측 디코딩(MDSD)은 각 토큰을 생성할 때 작은 초안 모델이 여러 초안을 생성하고, 대상 LLM이 이를 병렬로 검증하여 최종 출력이 대상 모델의 분포를 따르도록 보장하는 접근법입니다. MDSD의 주요 설계 선택 요소는 초안 샘플링 방법과 검증 알고리즘입니다. 고정된 초안 샘플링 방법에 대해 최적 수용률은 최적 운송 문제의 해결책이지만, 이 문제의 복잡성으로 인해 최적 수용률을 구하고 기존 검증 알고리즘과 이론적 상한 간의 차이를 측정하기가 어렵습니다. 본 논문은 최적 운송 문제의 쌍대 문제를 논의함으로써 최적 수용률을 효율적으로 계산하는 방법을 제시합니다. 우리는 처음으로 수천 개의 어휘 크기에 대해 MDSD 효율성의 이론적 상한을 측정하고, 기존 검증 알고리즘과 이 상한 간의 차이를 정량화합니다. 또한, 우리는 다양한 초안 샘플링 방법을 최적 수용률을 기준으로 비교합니다. 우리의 결과는 초안 샘플링 방법이 최적 수용률에 큰 영향을 미치며, 복원 없이 샘플링하는 것이 복원 샘플링보다 우수함을 보여줍니다. 또한, 기존 검증 알고리즘은 복원 없이 샘플링과 복원 샘플링 모두에서 이론적 상한에 도달하지 못합니다. 우리의 연구 결과는 신중하게 설계된 초안 샘플링 방법이 최적 수용률을 개선하고, 이론적 상한에 근접한 검증 알고리즘 개발을 가능하게 할 수 있음을 시사합니다.

English

Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.

최적의 다중 초안 스펙티브 디코딩을 향하여

Towards Optimal Multi-draft Speculative Decoding

초록

Support