ChatPaper.aiChatPaper

大型推理模型能否在感知不确定性下進行類比推理?

Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?

March 14, 2025
作者: Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
cs.AI

摘要

本研究首次對兩款最先進的大型推理模型(LRMs)——OpenAI的o3-mini和DeepSeek R1——在類比推理上的表現進行了評估,重點關注基於瑞文漸進矩陣的非言語人類智商測試。我們使用I-RAVEN數據集及其更難的擴展版本I-RAVEN-X進行基準測試,後者測試了模型在更長推理規則和屬性值範圍上的泛化能力。為了評估視覺不確定性對這些非言語類比推理測試的影響,我們擴展了I-RAVEN-X數據集,該數據集原本假設了完美的感知能力。我們採用雙重策略來模擬這種不完美的視覺感知:1)引入混淆屬性,這些屬性隨機採樣,不影響謎題正確答案的預測;2)平滑輸入屬性值的分佈。我們觀察到OpenAI的o3-mini任務準確率急劇下降,從原始I-RAVEN上的86.6%降至更具挑戰性的I-RAVEN-X上的17.0%——接近隨機猜測——後者增加了輸入長度和範圍,並模擬了感知不確定性。儘管使用了3.4倍多的推理token,這一下降仍然發生。DeepSeek R1也呈現出類似趨勢:從80.6%降至23.2%。另一方面,在I-RAVEN上達到最先進性能的神經符號概率溯因模型ARLC,在這些分佈外測試中能夠穩健地進行推理,僅從98.6%略微降至88.0%,保持了較高的準確率。我們的代碼可在https://github.com/IBM/raven-large-language-models 獲取。
English
This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these nonverbal analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles and 2) smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.

Summary

AI-Generated Summary

PDF52March 17, 2025