BiTA:大型語言模型中無損加速的雙向調整
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
January 23, 2024
作者: Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
cs.AI
摘要
大型語言模型(LLMs)通常在推論過程中採用自回歸生成,導致高內存帶寬需求,進而延長延遲時間。為了減輕這種效率低下的情況,我們提出了Lossless Acceleration的雙向調整(BiTA),透過簡化的半自回歸生成和初步驗證來加速LLMs的創新方法。受提示調整概念的啟發,我們使用一種稱為雙向調整的參數高效設計來增強LLMs在半自回歸生成方面的能力。採用高效的基於樹的解碼,模型可以並行執行初步候選生成和驗證,確保在貪婪抽樣下輸出與其自回歸對應物相同。BiTA作為一個輕量級的插件模塊,無縫地提升現有LLMs的推論效率,而無需額外的輔助模型或產生顯著的額外內存成本。應用所提出的BiTA,LLaMA-2-70B-Chat在MT-Bench基準測試中實現了2.7倍的加速。廣泛的實驗證實了我們的方法超越了最先進的加速技術。
English
Large language models (LLMs) commonly employ autoregressive generation during
inference, leading to high memory bandwidth demand and consequently extended
latency. To mitigate this inefficiency, we present Bi-directional Tuning for
lossless Acceleration (BiTA), an innovative method expediting LLMs via
streamlined semi-autoregressive generation and draft verification. Inspired by
the concept of prompt tuning, we enhance LLMs with a parameter-efficient design
called bi-directional tuning for the capability in semi-autoregressive
generation. Employing efficient tree-based decoding, the models perform draft
candidate generation and verification in parallel, ensuring outputs identical
to their autoregressive counterparts under greedy sampling. BiTA serves as a
lightweight plug-in module, seamlessly boosting the inference efficiency of
existing LLMs without requiring additional assistance models or incurring
significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat
achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments
confirm our method surpasses state-of-the-art acceleration techniques.