전제 순서는 대형 언어 모델을 활용한 추론에서 중요하다

초록

대규모 언어 모델(LLMs)은 다양한 영역에서 놀라운 추론 성능을 달성해 왔다. 그러나 추론 과제 영역에서 우리는 한 가지 취약점을 발견했다: 전제의 순서가 근본적인 과제를 변경하지 않음에도 불구하고, LLMs는 전제 순서에 대해 놀라울 정도로 취약하다. 특히, 우리는 중간 추론 단계에서 요구되는 맥락과 전제 순서가 일치할 때 LLMs가 최고의 성능을 달성한다는 것을 관찰했다. 예를 들어, 연역적 추론 과제에서 전제를 프롬프트 내에서 실제 증명과 동일한 순서로 제시하는 경우(무작위 순서와 반대로), 모델의 정확도가 크게 증가한다. 우리는 먼저 다양한 LLMs에서 전제 순서가 연역적 추론에 미치는 영향을 조사했으며, 평가 결과 전제 순서를 변경하면 성능이 30% 이상 하락할 수 있음을 확인했다. 또한, 우리는 수학적 문제 해결을 위한 순서 효과를 조사하기 위해 GSM8K를 기반으로 한 벤치마크 R-GSM을 공개했고, 원래의 GSM8K 벤치마크에 비해 정확도가 크게 하락하는 것을 다시 한 번 관찰했다.

English

Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

전제 순서는 대형 언어 모델을 활용한 추론에서 중요하다

Premise Order Matters in Reasoning with Large Language Models

초록

Support