BiTA: Bi-directionele afstemming voor verliesvrije versnelling in grote taalmodellen

Samenvatting

Grote taalmodellen (LLMs) maken vaak gebruik van autoregressieve generatie tijdens inferentie, wat leidt tot een hoge vraag naar geheugenbandbreedte en daardoor tot verlengde latentie. Om deze inefficiëntie te verminderen, presenteren we Bi-directional Tuning for Lossless Acceleration (BiTA), een innovatieve methode die LLMs versnelt via gestroomlijnde semi-autoregressieve generatie en conceptverificatie. Geïnspireerd door het concept van prompt tuning, verbeteren we LLMs met een parameter-efficiënt ontwerp genaamd bi-directionele tuning voor de mogelijkheid tot semi-autoregressieve generatie. Door gebruik te maken van efficiënte boomgebaseerde decodering, voeren de modellen conceptkandidatengeneratie en verificatie parallel uit, waardoor uitvoer wordt gegarandeerd die identiek is aan hun autoregressieve tegenhangers onder gretige sampling. BiTA fungeert als een lichtgewicht plug-in module, die naadloos de inferentie-efficiëntie van bestaande LLMs verhoogt zonder aanvullende hulpmodellen te vereisen of aanzienlijke extra geheugenkosten te veroorzaken. Door de voorgestelde BiTA toe te passen, behaalt LLaMA-2-70B-Chat een 2,7-voudige versnelling op de MT-Bench benchmark. Uitgebreide experimenten bevestigen dat onze methode state-of-the-art versnellingsmethoden overtreft.

English

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

BiTA: Bi-directionele afstemming voor verliesvrije versnelling in grote taalmodellen

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Samenvatting

Support