大语言模型推理中的前提顺序很重要。
Premise Order Matters in Reasoning with Large Language Models
February 14, 2024
作者: Xinyun Chen, Ryan A. Chi, Xuezhi Wang, Denny Zhou
cs.AI
摘要
大型语言模型(LLMs)在各个领域取得了显著的推理性能。然而,在推理任务领域,我们发现了一个脆弱性:尽管这种排序不会改变基本任务,但LLMs对前提的排序异常脆弱。特别是,我们观察到,当前提顺序与中间推理步骤所需的上下文一致时,LLMs能够实现最佳性能。例如,在演绎推理任务中,将前提按照提示中地面实证证明的顺序呈现(而不是随机排序),会极大地提高模型的准确性。我们首先研究了前提排序对各种LLMs在演绎推理上的影响,我们的评估显示,对前提顺序进行排列可能导致性能下降超过30%。此外,我们发布了基于GSM8K的基准R-GSM,以检验数学问题解决中的排序效应,我们再次观察到与原始GSM8K基准相比准确性显著下降。
English
Large language models (LLMs) have accomplished remarkable reasoning
performance in various domains. However, in the domain of reasoning tasks, we
discover a frailty: LLMs are surprisingly brittle to the ordering of the
premises, despite the fact that such ordering does not alter the underlying
task. In particular, we observe that LLMs achieve the best performance when the
premise order aligns with the context required in intermediate reasoning steps.
For example, in deductive reasoning tasks, presenting the premises in the same
order as the ground truth proof in the prompt (as opposed to random ordering)
drastically increases the model's accuracy. We first examine the effect of
premise ordering on deductive reasoning on a variety of LLMs, and our
evaluation shows that permuting the premise order can cause a performance drop
of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to
examine the ordering effect for mathematical problem-solving, and we again
observe a significant drop in accuracy, relative to the original GSM8K
benchmark.Summary
AI-Generated Summary