思维链中心:持续努力衡量大型语言模型的推理表现
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance
May 26, 2023
作者: Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot
cs.AI
摘要
随着大型语言模型(LLMs)不断被开发,它们的评估变得越来越重要,但也更具挑战性。本文提出了“Chain-of-Thought Hub”,这是一个关于大型语言模型多步推理能力的开源评估套件。我们对这一设置感兴趣有两个原因:(1)从GPT和PaLM模型系列的行为中,我们观察到复杂推理很可能是较弱和较强LLMs之间的关键差异因素;(2)我们设想大型语言模型将成为下一代计算平台,并促进基于LLM的新应用生态系统的发展,这自然需要基础模型执行通常涉及语言和逻辑操作组合的复杂任务。我们的方法是编制一套具有挑战性的推理基准,以跟踪LLMs的进展。我们目前的结果显示:(1)模型规模与推理能力明显相关;(2)截至2023年5月,Claude-v1.3和PaLM-2是仅有的两个与GPT-4可比的模型,而开源模型仍然落后;(3)LLaMA-65B的表现接近于code-davinci-002,表明通过成功的进一步发展,如从人类反馈中进行强化学习(RLHF),它有望接近于GPT-3.5-Turbo。我们的结果还表明,为了赶上开源努力,社区可能需要更多专注于构建更好的基础模型并探索RLHF。
English
As large language models (LLMs) are continuously being developed, their
evaluation becomes increasingly important yet challenging. This work proposes
Chain-of-Thought Hub, an open-source evaluation suite on the multi-step
reasoning capabilities of large language models. We are interested in this
setting for two reasons: (1) from the behavior of GPT and PaLM model family, we
observe that complex reasoning is likely to be a key differentiator between
weaker and stronger LLMs; (2) we envisage large language models to become the
next-generation computational platform and foster an ecosystem of LLM-based new
applications, this naturally requires the foundation models to perform complex
tasks that often involve the composition of linguistic and logical operations.
Our approach is to compile a suite of challenging reasoning benchmarks to track
the progress of LLMs. Our current results show that: (1) model scale clearly
correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and
PaLM-2 are the only two models that are comparable with GPT-4, while
open-sourced models still lag behind; (3) LLaMA-65B performs closely to
code-davinci-002, indicating that with successful further development such as
reinforcement learning from human feedback (RLHF), it has great potential to be
close to GPT-3.5-Turbo. Our results also suggest that for the open-source
efforts to catch up, the community may focus more on building better base
models and exploring RLHF.