BiTA: 대규모 언어 모델의 무손실 가속화를 위한 양방향 튜닝

초록

대형 언어 모델(LLMs)은 일반적으로 추론 과정에서 자기회귀적 생성을 사용하여 높은 메모리 대역폭 요구량과 이로 인한 지연 시간 증가를 초래합니다. 이러한 비효율성을 완화하기 위해, 우리는 Bi-directional Tuning for lossless Acceleration (BiTA)라는 혁신적인 방법을 제안합니다. 이 방법은 간소화된 준-자기회귀적 생성과 초안 검증을 통해 LLMs의 속도를 향상시킵니다. 프롬프트 튜닝 개념에서 영감을 받아, 우리는 준-자기회귀적 생성 능력을 위한 매개변수 효율적 설계인 양방향 튜닝을 LLMs에 적용합니다. 효율적인 트리 기반 디코딩을 사용하여 모델은 초안 후보 생성과 검증을 병렬로 수행하며, 탐욕적 샘플링 하에서 자기회귀적 생성과 동일한 출력을 보장합니다. BiTA는 경량 플러그인 모듈로 작동하여, 추가적인 보조 모델이나 상당한 추가 메모리 비용 없이 기존 LLMs의 추론 효율성을 원활하게 향상시킵니다. 제안된 BiTA를 적용한 LLaMA-2-70B-Chat은 MT-Bench 벤치마크에서 2.7배의 속도 향상을 달성했습니다. 광범위한 실험을 통해 우리의 방법이 최첨단 가속 기술을 능가함을 확인하였습니다.

English

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

BiTA: 대규모 언어 모델의 무손실 가속화를 위한 양방향 튜닝

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

초록

Support