인간과 대규모 언어 모델의 확률적 추론에서의 차이

초록

인간의 추론은 종종 제한된 정보를 바탕으로 확률적 결론에 도달하는 과정을 수반합니다. 가장 단순한 형태로는 전제로부터 엄밀하게 필연적으로 도출되는 것이 아니라 전제가 주어졌을 때 개연성 있는 추론을 만들어내는 것을 포함합니다. 추론 능력을 갖춘 대규모 언어 모델(LLM)이 논리 및 수학적 과제에서 강력한 성능을 보여주었지만, 이러한 개방형 비결정적 추론에 대한 모델의 행동은 여전히 크게 탐구되지 않았습니다. 본 연구에서는 영어로 작성된 210개의 수제 확률적 추론 예시로 구성된 ProbCOPA 데이터셋을 소개합니다. 각 예시는 25~30명의 인간 참가자에 의해 추론 가능성이 주석 처리되었습니다. 우리는 인간의 응답이 등급화되고 다양하게 나타나, 데이터셋 내 추론에 대한 확률적 판단을 드러낸다는 것을 발견했습니다. 이러한 판단을 8개의 최첨단 추론 LLM의 응답과 비교한 결과, 모델들이 지속적으로 인간과 유사한 분포를 생성하지 못하는 것으로 나타났습니다. 마지막으로 LLM의 추론 사슬을 분석함으로써, 이러한 추론을 평가하는 데 사용되는 공통적인 추론 패턴의 증거를 발견했습니다. 우리의 연구 결과는 인간과 LLM 사이의 지속적인 차이를 드러내며, 결정론적 환경을 넘어선 추론 평가의 필요성을 강조합니다.

English

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

인간과 대규모 언어 모델의 확률적 추론에서의 차이

Humans and LLMs Diverge on Probabilistic Inferences

초록

Support