BiTA：大型語言模型中無損加速的雙向調整

摘要

大型語言模型（LLMs）通常在推論過程中採用自回歸生成，導致高內存帶寬需求，進而延長延遲時間。為了減輕這種效率低下的情況，我們提出了Lossless Acceleration的雙向調整（BiTA），透過簡化的半自回歸生成和初步驗證來加速LLMs的創新方法。受提示調整概念的啟發，我們使用一種稱為雙向調整的參數高效設計來增強LLMs在半自回歸生成方面的能力。採用高效的基於樹的解碼，模型可以並行執行初步候選生成和驗證，確保在貪婪抽樣下輸出與其自回歸對應物相同。BiTA作為一個輕量級的插件模塊，無縫地提升現有LLMs的推論效率，而無需額外的輔助模型或產生顯著的額外內存成本。應用所提出的BiTA，LLaMA-2-70B-Chat在MT-Bench基準測試中實現了2.7倍的加速。廣泛的實驗證實了我們的方法超越了最先進的加速技術。

English

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7times speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

BiTA：大型語言模型中無損加速的雙向調整

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

摘要

Support