快速思维链：从并行解码的一瞥看未来，带来更快的答案

摘要

在这项工作中，我们提出了FastCoT，这是一个基于并行解码的模型无关框架，无需对辅助模型进行进一步训练或对LLM本身进行修改。FastCoT使用一个大小可变的上下文窗口，其大小随位置变化以同时进行并行解码和自回归解码，从而充分利用GPU计算资源。在FastCoT中，并行解码部分为LLM提供了一个由近似标记组成的未来快速概览，这可能比因果变换器使用的常规自回归解码导致更快的答案。我们还提供了LLM内并行解码的实现，支持KV缓存生成和批处理。通过大量实验，我们证明FastCoT相较于常规方法，推理时间节省了近20％，而性能下降可忽略不计。此外，我们展示上下文窗口大小对不同任务表现出相当大的鲁棒性。

English

In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.

快速思维链：从并行解码的一瞥看未来，带来更快的答案

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

摘要

Support