スケルトン・オブ・ソート：大規模言語モデルは並列デコードを実現可能

要旨

本研究は、大規模言語モデル（LLMs）のエンドツーエンド生成遅延を低減することを目的としている。生成遅延が高い主な原因の一つは、ほぼすべての最先端LLMsが採用している逐次デコードアプローチである。本研究では、人間の思考および執筆プロセスに着想を得て、「Skeleton-of-Thought」（SoT）を提案する。SoTは、LLMsにまず回答の骨組みを生成させ、その後、並列API呼び出しまたはバッチデコードを行い、各骨組みポイントの内容を並列に完成させるものである。SoTは、速度の大幅な向上（11種類の異なるLLMsにおいて最大2.39倍）を提供するだけでなく、多様性と関連性の観点から、いくつかの質問カテゴリにおいて回答品質の向上も期待できる。SoTは、効率化のためのデータ中心最適化の初期試みであり、LLMsに人間のように思考させることで回答品質を向上させる可能性を示唆している。

English

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose "Skeleton-of-Thought" (SoT), which guides LLMs to first generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-up (up to 2.39x across 11 different LLMs), but it can also potentially improve the answer quality on several question categories in terms of diversity and relevance. SoT is an initial attempt at data-centric optimization for efficiency, and reveal the potential of pushing LLMs to think more like a human for answer quality.

スケルトン・オブ・ソート：大規模言語モデルは並列デコードを実現可能

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

要旨

Support