思维骨架：大型语言模型能够进行并行解码

摘要

本研究旨在降低大型语言模型（LLMs）的端到端生成延迟。高生成延迟的主要原因之一是几乎所有最先进的LLMs都采用的顺序解码方法。在本研究中，受到人类思考和写作过程的启发，我们提出了“思维骨架”（SoT），它指导LLMs首先生成答案的骨架，然后进行并行API调用或批量解码以并行完成每个骨架点的内容。SoT不仅提供了相当大的加速（在11种不同的LLMs中高达2.39倍），而且还有可能在多个问题类别上改善答案质量，包括多样性和相关性。SoT是为了效率而进行的数据中心优化的初步尝试，并揭示了将LLMs推动更像人类思考以提高答案质量的潜力。

English

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose "Skeleton-of-Thought" (SoT), which guides LLMs to first generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-up (up to 2.39x across 11 different LLMs), but it can also potentially improve the answer quality on several question categories in terms of diversity and relevance. SoT is an initial attempt at data-centric optimization for efficiency, and reveal the potential of pushing LLMs to think more like a human for answer quality.

思维骨架：大型语言模型能够进行并行解码

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

摘要

Support