奧林匹克競技場：為超智能人工智能進行多學科認知推理基準測試

摘要

人工智慧（AI）的演進在很大程度上受到大型語言模型（LLMs）和大型多模態模型（LMMs）的進展所加速，逐漸展現出在問題解決和科學發現中潛在的認知推理能力（即AI4Science），這些能力曾經只屬於人類智慧。為了全面評估當前模型在認知推理能力方面的表現，我們引入了OlympicArena，其中包括了11,163個雙語問題，涵蓋了僅文字和交錯文字-圖像兩種模態。這些挑戰涵蓋了七個領域和62個國際奧林匹克比賽，嚴格檢查了數據泄漏問題。我們認為奧林匹克競賽問題中的挑戰非常適合用於評估AI的認知推理，因為這些問題的複雜性和跨學科性對於應對複雜科學挑戰和促進發現至關重要。除了使用僅答案標準來評估不同學科的表現之外，我們還從多個角度進行了詳細的實驗和分析。我們深入研究了模型的認知推理能力、它們在不同模態下的表現以及它們在過程級評估中的結果，這對於需要進行複雜推理並提供冗長解決方案的任務至關重要。我們的廣泛評估顯示，即使像GPT-4o這樣的先進模型也僅達到了39.97％的總體準確率，說明了當前AI在複雜推理和多模態整合方面的局限性。通過OlympicArena，我們旨在推動AI邁向超級智能，使其能夠應對科學及其他領域中更複雜的挑戰。我們還提供了一套全面的資源來支持AI研究，包括基準數據集、開源標註平台、詳細的評估工具以及具有自動提交功能的排行榜。

English

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

奧林匹克競技場：為超智能人工智能進行多學科認知推理基準測試

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

摘要

Support