OPT-R：探讨解释在大型语言模型微调和提示中对推理能力的作用

摘要

本文对大型语言模型（LLMs）的推理能力进行了彻底调查，重点关注Open Pretrained Transformers（OPT）模型作为这类模型的代表。我们的研究包括在精心策划的推理语料库上微调三种不同规模的OPT，得到两组微调模型：未附解释的OPT-R和附解释的OPT-RE。然后，我们在来自SUPER-NATURALINSTRUCTIONS基准测试的57个领域外任务上评估所有模型，涵盖26种不同的推理技能，利用三种提示技术。通过27种配置和6156次测试评估的全面网格，我们研究微调、提示和规模的维度，以了解解释在不同推理技能上的作用。我们的研究结果显示，在模型微调时，在fewshot示例中加入解释对模型性能没有显著影响，但对未微调的对应模型有积极影响。此外，我们观察到随着在提示和微调过程中逐渐加入解释，分类准确性略微但一致地提高。最后，我们提供了关于哪些技能最能从在微调和提示过程中加入解释中受益的见解，例如数值（+20.4%）和类比（+13.9%）推理，以及表现出微不足道或负面影响的技能。

English

In this paper, we conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model's performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4%) and Analogical (+13.9%) reasoning, as well as skills that exhibit negligible or negative effects.

OPT-R：探讨解释在大型语言模型微调和提示中对推理能力的作用

OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models

摘要

Support