OPT-R:探索解釋在對大型語言模型進行微調和提示時對推理能力的作用
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models
May 19, 2023
作者: Badr AlKhamissi, Siddharth Verma, Ping Yu, Zhijing Jin, Asli Celikyilmaz, Mona Diab
cs.AI
摘要
本文對大型語言模型(LLMs)的推理能力進行了深入研究,特別聚焦於Open Pretrained Transformers(OPT)模型作為此類模型的代表。我們的研究包括在精心策劃的推理語料庫上對三種不同大小的OPT進行微調,從而產生兩組微調模型:OPT-R,沒有解釋的微調,以及OPT-RE,帶有解釋的微調。然後,我們對所有模型在來自SUPER-NATURALINSTRUCTIONS基準測試集的57個跨領域任務上進行評估,涵蓋26種不同的推理技能,並利用三種提示技術。通過27種配置和6156次測試評估的全面矩陣,我們研究微調、提示和規模的維度,以了解解釋在不同推理技能上的作用。我們的研究結果顯示,在模型進行微調時,在fewshot範例中加入解釋對模型的性能沒有顯著影響,但對未經微調的對應模型有積極影響。此外,我們觀察到在提示和微調過程中逐漸加入解釋時,分類準確性略微但一致地提高。最後,我們提供了有關哪些技能在微調和提示過程中最能從加入解釋中受益的見解,例如數值(+20.4%)和類比(+13.9%)推理,以及表現微不足道或負面影響的技能。
English
In this paper, we conduct a thorough investigation into the reasoning
capabilities of Large Language Models (LLMs), focusing specifically on the Open
Pretrained Transformers (OPT) models as a representative of such models. Our
study entails finetuning three different sizes of OPT on a carefully curated
reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned
without explanations, and OPT-RE, finetuned with explanations. We then evaluate
all models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS
benchmark, covering 26 distinct reasoning skills, utilizing three prompting
techniques. Through a comprehensive grid of 27 configurations and 6,156 test
evaluations, we investigate the dimensions of finetuning, prompting, and scale
to understand the role of explanations on different reasoning skills. Our
findings reveal that having explanations in the fewshot exemplar has no
significant impact on the model's performance when the model is finetuned,
while positively affecting the non-finetuned counterpart. Moreover, we observe
a slight yet consistent increase in classification accuracy as we incorporate
explanations during prompting and finetuning, respectively. Finally, we offer
insights on which skills benefit the most from incorporating explanations
during finetuning and prompting, such as Numerical (+20.4%) and Analogical
(+13.9%) reasoning, as well as skills that exhibit negligible or negative
effects.