BiTA:大型语言模型中无损加速的双向调整
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
January 23, 2024
作者: Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
cs.AI
摘要
大型语言模型(LLMs)通常在推理过程中采用自回归生成,导致高内存带宽需求,从而延长延迟时间。为了减轻这种低效率,我们提出了Lossless Acceleration的双向调整(BiTA),这是一种创新方法,通过简化的半自回归生成和初步验证来加快LLMs。受提示调整概念启发,我们采用一种称为双向调整的参数高效设计,以实现半自回归生成的能力。利用高效的基于树的解码,模型同时进行初步候选生成和验证,确保在贪婪采样下输出与其自回归对应物相同。BiTA作为一个轻量级的插件模块,可以无缝地提高现有LLMs的推理效率,而无需额外的辅助模型或产生显著的额外内存成本。应用所提出的BiTA,LLaMA-2-70B-Chat在MT-Bench基准测试中实现了2.7倍的加速。大量实验证实我们的方法超越了最先进的加速技术。
English
Large language models (LLMs) commonly employ autoregressive generation during
inference, leading to high memory bandwidth demand and consequently extended
latency. To mitigate this inefficiency, we present Bi-directional Tuning for
lossless Acceleration (BiTA), an innovative method expediting LLMs via
streamlined semi-autoregressive generation and draft verification. Inspired by
the concept of prompt tuning, we enhance LLMs with a parameter-efficient design
called bi-directional tuning for the capability in semi-autoregressive
generation. Employing efficient tree-based decoding, the models perform draft
candidate generation and verification in parallel, ensuring outputs identical
to their autoregressive counterparts under greedy sampling. BiTA serves as a
lightweight plug-in module, seamlessly boosting the inference efficiency of
existing LLMs without requiring additional assistance models or incurring
significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat
achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments
confirm our method surpasses state-of-the-art acceleration techniques.