人类与大型语言模型在概率推理上的分歧
Humans and LLMs Diverge on Probabilistic Inferences
February 26, 2026
作者: Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy
cs.AI
摘要
人类推理常基于有限信息得出概率性结论。其最简形式表现为:从前提出发进行推断,该推断虽非前提的必然结果,但在给定前提下具有较高可能性。尽管推理大语言模型在逻辑与数学任务中表现优异,但其在这种开放式、非确定性推理中的行为机制仍待探索。我们提出ProbCOPA数据集,包含210个手工构建的英文概率推理案例,每个案例均获得25-30名人类参与者的推理可能性标注。研究发现人类反馈呈现梯度化差异,揭示了数据集中推理的概率判断特性。通过将这些人本判断与八个前沿推理大语言模型的输出对比,我们发现模型始终无法生成类人的概率分布。进一步分析大语言模型的推理链,我们发现了其评估此类推理的共性模式。本研究揭示了人机推理的持续性差异,强调需要超越确定性场景的推理评估框架。
English
Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.