ChatPaper.aiChatPaper

BiTA:大型語言模型中無損加速的雙向調整

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

January 23, 2024
作者: Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
cs.AI

摘要

大型語言模型(LLMs)通常在推論過程中採用自回歸生成,導致高內存帶寬需求,進而延長延遲時間。為了減輕這種效率低下的情況,我們提出了Lossless Acceleration的雙向調整(BiTA),透過簡化的半自回歸生成和初步驗證來加速LLMs的創新方法。受提示調整概念的啟發,我們使用一種稱為雙向調整的參數高效設計來增強LLMs在半自回歸生成方面的能力。採用高效的基於樹的解碼,模型可以並行執行初步候選生成和驗證,確保在貪婪抽樣下輸出與其自回歸對應物相同。BiTA作為一個輕量級的插件模塊,無縫地提升現有LLMs的推論效率,而無需額外的輔助模型或產生顯著的額外內存成本。應用所提出的BiTA,LLaMA-2-70B-Chat在MT-Bench基準測試中實現了2.7倍的加速。廣泛的實驗證實了我們的方法超越了最先進的加速技術。
English
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
PDF121December 15, 2024