BiTA：大型语言模型中无损加速的双向调整

摘要

大型语言模型（LLMs）通常在推理过程中采用自回归生成，导致高内存带宽需求，从而延长延迟时间。为了减轻这种低效率，我们提出了Lossless Acceleration的双向调整（BiTA），这是一种创新方法，通过简化的半自回归生成和初步验证来加快LLMs。受提示调整概念启发，我们采用一种称为双向调整的参数高效设计，以实现半自回归生成的能力。利用高效的基于树的解码，模型同时进行初步候选生成和验证，确保在贪婪采样下输出与其自回归对应物相同。BiTA作为一个轻量级的插件模块，可以无缝地提高现有LLMs的推理效率，而无需额外的辅助模型或产生显著的额外内存成本。应用所提出的BiTA，LLaMA-2-70B-Chat在MT-Bench基准测试中实现了2.7倍的加速。大量实验证实我们的方法超越了最先进的加速技术。

English

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

BiTA：大型语言模型中无损加速的双向调整

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

摘要

Support