高速思考連鎖：並列デコードからの未来の一瞥がより速い解答へと導く

要旨

本研究では、補助モデルの追加学習やLLM自体の変更を必要とせず、並列デコードに基づくモデル非依存のフレームワークであるFastCoTを提案します。FastCoTは、位置に応じてサイズが変化する可変長コンテキストウィンドウを使用して、並列デコードと自己回帰デコードを同時に実行し、GPUの計算リソースを最大限に活用します。FastCoTでは、並列デコード部分がLLMに近似トークンで構成された未来の概要を迅速に提供し、因果的トランスフォーマーが使用する通常の自己回帰デコードと比較して、より高速な回答を可能にします。また、LLM内での並列デコードの実装も提供し、KVキャッシュ生成とバッチ処理をサポートします。大規模な実験を通じて、FastCoTが推論時間を約20%短縮し、通常のアプローチと比較して性能の低下がほとんどないことを実証します。さらに、コンテキストウィンドウのサイズが異なるタスクに対してかなりの堅牢性を示すことも明らかにしました。

English

In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.

高速思考連鎖：並列デコードからの未来の一瞥がより速い解答へと導く

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

要旨

Support