快速思維鏈：從平行解碼的一瞥看未來，帶來更快的答案。

摘要

在這項工作中，我們提出了FastCoT，一個基於平行解碼的模型無關框架，無需對輔助模型進行進一步訓練或對LLM本身進行修改。FastCoT使用一個大小可變的上下文窗口，其大小隨位置變化，以同時進行平行解碼和自回歸解碼，從而充分利用GPU計算資源。在FastCoT中，平行解碼部分為LLM提供了未來的快速概覽，由近似標記組成，這可能比因果變換器使用的常規自回歸解碼更快地產生答案。我們還提供了LLM內平行解碼的實現，支持KV-快取生成和批處理。通過廣泛的實驗，我們展示了FastCoT相較於常規方法，節省了近20%的推論時間，而性能下降幾乎可以忽略不計。此外，我們表明上下文窗口大小對於不同任務具有相當的韌性。

English

In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.

快速思維鏈：從平行解碼的一瞥看未來，帶來更快的答案。

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

摘要

Support