POSS: 포지션 전문가가 스펙큘레이티브 디코딩을 위한 더 나은 초안 생성

초록

추론적 디코딩은 작은 드래프트 모델을 사용하여 다중 토큰을 예측하고, 대형 타겟 모델을 통해 이러한 토큰을 병렬로 검증함으로써 대형 언어 모델(LLM)의 추론 속도를 가속화합니다. 최근 연구에서는 타겟 모델의 은닉 상태를 활용하여 드래프트 모델의 예측 정확도를 향상시키고 있습니다. 그러나 기존 방법들은 드래프트 모델에서 생성된 특징의 오류 누적으로 인해 후반 위치에서 드래프트 토큰 예측의 품질이 저하되는 문제를 겪고 있습니다. 본 논문에서는 특정 위치(들)에서 토큰을 생성하기 위해 다중 위치 전문화 드래프트 레이어로 구성된 위치 전문가(Position Specialists, PosS)를 제안합니다. 위치 전문가는 각 전문가가 특정 수준의 드래프트 모델 특징 편차만 처리하면 되기 때문에, 드래프팅 라운드에서 후반 위치의 토큰 수용률을 크게 향상시킵니다. Llama-3-8B-Instruct와 Llama-2-13B-chat 모델을 사용한 6개 데이터셋에 대한 실험 결과는 PosS가 평균 수용 길이와 속도 향상 비율에서 기준선을 효과적으로 개선함을 보여줍니다. 우리의 코드베이스는 https://github.com/shrango/PosS에서 확인할 수 있습니다.

English

Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.

POSS: 포지션 전문가가 스펙큘레이티브 디코딩을 위한 더 나은 초안 생성

POSS: Position Specialist Generates Better Draft for Speculative Decoding

초록

Support