人間と大規模言語モデルは確率的推論において乖離を示す

要旨

人間の推論は、限られた情報を基に確率的な結論に至る過程を伴うことが多い。最も単純な形式では、これは前提から厳密に必然的に導かれるわけではないが、前提を与えられた場合に蓋然的であるに過ぎない推論を行うことを含む。推論能力を持つ大規模言語モデル（LLM）は論理的・数学的タスクで高い性能を示しているが、このようなオープンエンドで非決定論的な推論における振る舞いは、ほとんど未探査のままである。本研究では、ProbCOPAを紹介する。これは英語で書かれた210の手作りによる確率的推論から成るデータセットであり、各推論は25～30名の人間参加者による推論の尤度が注釈付けされている。我々は、人間の回答が段階的かつ多様であり、データセット内の推論に対する確率的判断が表れていることを明らかにした。これらの判断を8つの最先端推論LLMの応答と比較した結果、モデルは一貫して人間らしい分布を生成できないことを示す。最後に、LLMの推論連鎖を分析し、この種の推論を評価するために用いられる共通の推論パターンの証拠を見出した。我々の発見は、人間とLLMの間の頑固な差異を明らかにし、決定論的設定を超えた推論評価の必要性を強調するものである。

English

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

人間と大規模言語モデルは確率的推論において乖離を示す

Humans and LLMs Diverge on Probabilistic Inferences

要旨

Support