快速思維鏈:從平行解碼的一瞥看未來,帶來更快的答案。
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
November 14, 2023
作者: Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
cs.AI
摘要
在這項工作中,我們提出了FastCoT,一個基於平行解碼的模型無關框架,無需對輔助模型進行進一步訓練或對LLM本身進行修改。FastCoT使用一個大小可變的上下文窗口,其大小隨位置變化,以同時進行平行解碼和自回歸解碼,從而充分利用GPU計算資源。在FastCoT中,平行解碼部分為LLM提供了未來的快速概覽,由近似標記組成,這可能比因果變換器使用的常規自回歸解碼更快地產生答案。我們還提供了LLM內平行解碼的實現,支持KV-快取生成和批處理。通過廣泛的實驗,我們展示了FastCoT相較於常規方法,節省了近20%的推論時間,而性能下降幾乎可以忽略不計。此外,我們表明上下文窗口大小對於不同任務具有相當的韌性。
English
In this work, we propose FastCoT, a model-agnostic framework based on
parallel decoding without any further training of an auxiliary model or
modification to the LLM itself. FastCoT uses a size-varying context window
whose size changes with position to conduct parallel decoding and
auto-regressive decoding simultaneously, thus fully utilizing GPU computation
resources. In FastCoT, the parallel decoding part provides the LLM with a quick
glance of the future composed of approximate tokens, which could lead to faster
answers compared to regular autoregressive decoding used by causal
transformers. We also provide an implementation of parallel decoding within
LLM, which supports KV-cache generation and batch processing. Through extensive
experiments, we demonstrate that FastCoT saves inference time by nearly 20%
with only a negligible performance drop compared to the regular approach.
Additionally, we show that the context window size exhibits considerable
robustness for different tasks.