思维链中心：持续努力衡量大型语言模型的推理表现

摘要

随着大型语言模型（LLMs）不断被开发，它们的评估变得越来越重要，但也更具挑战性。本文提出了“Chain-of-Thought Hub”，这是一个关于大型语言模型多步推理能力的开源评估套件。我们对这一设置感兴趣有两个原因：（1）从GPT和PaLM模型系列的行为中，我们观察到复杂推理很可能是较弱和较强LLMs之间的关键差异因素；（2）我们设想大型语言模型将成为下一代计算平台，并促进基于LLM的新应用生态系统的发展，这自然需要基础模型执行通常涉及语言和逻辑操作组合的复杂任务。我们的方法是编制一套具有挑战性的推理基准，以跟踪LLMs的进展。我们目前的结果显示：（1）模型规模与推理能力明显相关；（2）截至2023年5月，Claude-v1.3和PaLM-2是仅有的两个与GPT-4可比的模型，而开源模型仍然落后；（3）LLaMA-65B的表现接近于code-davinci-002，表明通过成功的进一步发展，如从人类反馈中进行强化学习（RLHF），它有望接近于GPT-3.5-Turbo。我们的结果还表明，为了赶上开源努力，社区可能需要更多专注于构建更好的基础模型并探索RLHF。

English

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

思维链中心：持续努力衡量大型语言模型的推理表现

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

摘要

Support