더 빠른 LLM 추론을 위한 캐스케이드 스펙티브 드래프팅

초록

추측적 디코딩은 더 큰 대상 모델이 검토할 초안을 작성하기 위해 초안 모델을 활용함으로써 대규모 언어 모델(LLM)의 효율성을 향상시킨다. 그러나 추측적 디코딩에서의 초안 작성은 느린 자기회귀적 생성과 동일한 시간 할당으로 다른 중요성을 가진 토큰을 생성하는 과정을 포함한다. 이러한 두 가지 비효율성은 최적이 아닌 성능으로 이어진다. 이 문제를 해결하기 위해, 우리는 두 가지 유형의 캐스케이드를 활용한 새로운 접근 방식인 캐스케이드 추측적 초안 작성(CS. Drafting)을 소개한다. 수직 캐스케이드는 신경망 모델에서 자기회귀적 생성을 제거한다. 수평 캐스케이드는 우리의 이론적 분석에 의해 최적성이 입증된 효율적인 시간 할당을 초안 작성에 적용한다. 두 캐스케이드를 결합한 CS. Drafting 알고리즘은 동일한 출력 분포를 유지하면서 실험에서 추측적 디코딩 대비 최대 72%의 추가 속도 향상을 달성했다.

English

Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

더 빠른 LLM 추론을 위한 캐스케이드 스펙티브 드래프팅

Cascade Speculative Drafting for Even Faster LLM Inference

초록

Support