是否使用思维链？思维链主要有助于数学和符号推理。

摘要

通过提示的思维链（CoT）是从大型语言模型（LLMs）中引出推理能力的事实方法。但这种额外的“思考”对哪种类型的任务真正有帮助呢？为了分析这一点，我们进行了一项定量的元分析，涵盖了100多篇使用CoT的论文，并对14个模型中的20个数据集进行了我们自己的评估。我们的结果显示，CoT主要在涉及数学或逻辑的任务上带来了显著的性能优势，在其他类型的任务上获得的收益要小得多。在MMLU上，如果问题或模型的回答中包含等号，表明涉及符号操作和推理，那么直接生成答案而不使用CoT的准确率几乎与使用CoT相同。根据这一发现，我们通过将规划和执行分开，并与工具增强的LLMs进行比较，分析了CoT在这些问题上的行为。CoT的许多收益来自于改进符号执行，但相对于使用符号求解器，它表现不佳。我们的结果表明，CoT可以有选择地应用，保持性能的同时节省推理成本。此外，它们表明需要超越基于提示的CoT，转向更好地利用整个LLM应用范围内的中间计算的新范式。

English

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

是否使用思维链？思维链主要有助于数学和符号推理。

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

摘要

Support