跳躍推理曲線？追蹤 GPT-[n] 和 o-[n] 模型在多模式拼圖上推理表現的演化

摘要

OpenAI 的 o1 和 o3 的发布標誌著大型語言模型向先進推理能力的重大範式轉變。值得注意的是，o3 在人工智能通用智能（ARC-AGI）的抽象和推理語料庫中，在新穎問題解決和技能獲取方面超越了人類。然而，這個基準僅限於符號模式，而人類通常感知和推理涉及視覺和語言數據的多模式情景。因此，迫切需要研究多模式任務中的先進推理能力。為此，我們追踪 GPT-[n] 和 o-[n] 系列模型在具有挑戰性的多模式拼圖上的演進，這需要細粒度的視覺感知和抽象或算法推理。o1 的卓越表現幾乎是 GPT-4o 的 750 倍計算成本，引發了對其效率的擔憂。我們的結果顯示，在模型迭代過程中，推理能力呈現明顯上升趨勢，GPT 系列模型和隨後的 o1 之間出現了顯著的性能提升。然而，我們觀察到 o1 模型在需要抽象推理的簡單多模式拼圖方面仍然存在困難。此外，它在算法拼圖中的表現仍然較差。我們計劃持續追踪系列中的新模型，並相應地在本文中更新我們的結果。本評估中使用的所有資源都可以在 https://github.com/declare-lab/LLM-PuzzleTest 上公開獲取。

English

The releases of OpenAI's o1 and o3 mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, o3 outperformed humans in novel problem-solving and skill acquisition on the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles, requiring fine-grained visual perception with abstract or algorithmic reasoning. The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency. Our results reveal a clear upward trend in reasoning capabilities across model iterations, with notable performance jumps across GPT-series models and subsequently to o1. Nonetheless, we observe that the o1 model still struggles with simple multimodal puzzles requiring abstract reasoning. Furthermore, its performance in algorithmic puzzles remains poor. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available https://github.com/declare-lab/LLM-PuzzleTest.

跳躍推理曲線？追蹤 GPT-[n] 和 o-[n] 模型在多模式拼圖上推理表現的演化

The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

摘要

Support