추론적 스트리밍: 보조 모델 없이 빠른 LLM 추론

초록

추측적 디코딩은 보조 드래프트 모델의 예측을 기반으로 대형 목표 언어 모델의 추론 속도를 높이는 주요 기술이다. 이 방법은 효과적이지만, 특정 애플리케이션 환경에서는 높은 수용률을 달성하기 위해 드래프트 모델과 목표 모델 모두를 미세 조정해야 하는 경우가 많다. 다운스트림 작업의 수가 증가함에 따라, 이러한 드래프트 모델들은 추론 시스템에 상당한 복잡성을 더한다. 본 연구에서는 단일 모델 추측적 디코딩 방법인 '추측적 스트리밍'을 제안한다. 이 방법은 미세 조정 목표를 다음 토큰 예측에서 미래 n-그램 예측으로 변경함으로써 드래프팅을 목표 모델에 통합한다. 추측적 스트리밍은 요약, 구조화된 쿼리, 의미 표현 등 다양한 작업에서 생성 품질을 저하시키지 않으면서 디코딩 속도를 1.8배에서 3.1배까지 향상시킨다. 또한, 추측적 스트리밍은 매개변수 효율적이다. 이 방법은 Medusa 스타일 아키텍처와 동등하거나 더 높은 속도 향상을 달성하면서도 약 10,000배 적은 추가 매개변수를 사용하므로, 자원이 제한된 장치에 적합하다.

English

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

추론적 스트리밍: 보조 모델 없이 빠른 LLM 추론

Speculative Streaming: Fast LLM Inference without Auxiliary Models

초록

Summary

Support

Support