AdaSPEC: 효율적 추론 디코더를 위한 선택적 지식 증류

초록

추론적 디코딩(Speculative Decoding, SD)은 작은 드래프트 모델이 예측을 생성하고 이를 더 큰 대상 모델이 검증하는 방식으로 대규모 언어 모델의 추론 속도를 높입니다. SD의 효과는 이러한 모델 간의 정렬에 달려있으며, 이는 일반적으로 지식 증류(Knowledge Distillation, KD)를 통해 강화됩니다. 그러나 기존의 KD 방법은 모든 토큰에 대해 드래프트 모델과 대상 모델 간의 KL 발산을 최소화하는 것을 목표로 하는데, 이는 토큰 수용률을 최대화해야 하는 SD의 실제 목표와 일치하지 않습니다. 따라서 용량 제약으로 인해 드래프트 모델은 대상 모델의 지식을 완전히 흡수하는 데 어려움을 겪어 성능이 저하됩니다. 이 문제를 해결하기 위해 우리는 KD 과정에 선택적 토큰 필터링을 도입한 새로운 방법인 AdaSPEC을 제안합니다. AdaSPEC은 레퍼런스 모델을 활용하여 학습하기 어려운 토큰을 식별하고 걸러내어, 더 단순한 토큰에 대해 대상 모델과 더 잘 정렬된 드래프트 모델을 증류할 수 있게 합니다. 이 접근법은 생성 품질을 저하시키지 않으면서 전체 토큰 수용률을 향상시킵니다. 우리는 31M/1.4B 및 350M/2.7B 매개변수의 모델 구성을 사용하여 산술 추론, 지시 따르기, 코드 생성, 요약 등 다양한 작업에 대해 AdaSPEC을 평가했습니다. 결과는 AdaSPEC이 모든 작업에서 최신 방법인 DistillSpec을 일관되게 능가하며(최대 15%), 더 높은 수용률을 달성함을 보여줍니다. 코드는 https://github.com/yuezhouhu/adaspec 에서 공개되어 있습니다.

English

Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

AdaSPEC: 효율적 추론 디코더를 위한 선택적 지식 증류

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

초록

Support