人类与大型语言模型在概率推理上的分歧

摘要

人类推理常基于有限信息得出概率性结论。其最简形式表现为：从前提出发进行推断，该推断虽非前提的必然结果，但在给定前提下具有较高可能性。尽管推理大语言模型在逻辑与数学任务中表现优异，但其在这种开放式、非确定性推理中的行为机制仍待探索。我们提出ProbCOPA数据集，包含210个手工构建的英文概率推理案例，每个案例均获得25-30名人类参与者的推理可能性标注。研究发现人类反馈呈现梯度化差异，揭示了数据集中推理的概率判断特性。通过将这些人本判断与八个前沿推理大语言模型的输出对比，我们发现模型始终无法生成类人的概率分布。进一步分析大语言模型的推理链，我们发现了其评估此类推理的共性模式。本研究揭示了人机推理的持续性差异，强调需要超越确定性场景的推理评估框架。

English

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

人类与大型语言模型在概率推理上的分歧

Humans and LLMs Diverge on Probabilistic Inferences

摘要

Support