ChatPaper.aiChatPaper

大型推理模型能否在感知不确定性下进行类比推理?

Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?

March 14, 2025
作者: Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
cs.AI

摘要

本研究首次评估了两款最先进的大型推理模型(LRMs)——OpenAI的o3-mini与DeepSeek R1——在类比推理上的表现,重点关注基于瑞文渐进矩阵的非言语人类智商测试。我们以I-RAVEN数据集及其更具挑战性的扩展版I-RAVEN-X为基准,后者测试了模型对更长推理规则及属性值范围的泛化能力。为了评估视觉不确定性对这些非言语类比推理测试的影响,我们对I-RAVEN-X数据集进行了扩展,该数据集原本假设了完美的感知能力。我们采用双重策略来模拟这种不完美的视觉感知:1)引入混淆属性,这些属性随机采样,不参与谜题正确答案的预测;2)平滑输入属性值的分布。我们观察到,OpenAI的o3-mini任务准确率急剧下降,从原始I-RAVEN上的86.6%降至更具挑战性的I-RAVEN-X上的仅17.0%——接近随机猜测水平——尽管推理令牌使用量增加了3.4倍。DeepSeek R1也呈现相似趋势,准确率从80.6%降至23.2%。另一方面,在I-RAVEN上达到顶尖性能的神经符号概率溯因模型ARLC,在所有这些分布外测试中均能稳健推理,仅从98.6%小幅降至88.0%,保持了较高的准确率。我们的代码已公开于https://github.com/IBM/raven-large-language-models。
English
This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these nonverbal analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles and 2) smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.

Summary

AI-Generated Summary

PDF52March 17, 2025