奧林匹克競技場:為超智能人工智能進行多學科認知推理基準測試
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
June 18, 2024
作者: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
cs.AI
摘要
人工智慧(AI)的演進在很大程度上受到大型語言模型(LLMs)和大型多模態模型(LMMs)的進展所加速,逐漸展現出在問題解決和科學發現中潛在的認知推理能力(即AI4Science),這些能力曾經只屬於人類智慧。為了全面評估當前模型在認知推理能力方面的表現,我們引入了OlympicArena,其中包括了11,163個雙語問題,涵蓋了僅文字和交錯文字-圖像兩種模態。這些挑戰涵蓋了七個領域和62個國際奧林匹克比賽,嚴格檢查了數據泄漏問題。我們認為奧林匹克競賽問題中的挑戰非常適合用於評估AI的認知推理,因為這些問題的複雜性和跨學科性對於應對複雜科學挑戰和促進發現至關重要。除了使用僅答案標準來評估不同學科的表現之外,我們還從多個角度進行了詳細的實驗和分析。我們深入研究了模型的認知推理能力、它們在不同模態下的表現以及它們在過程級評估中的結果,這對於需要進行複雜推理並提供冗長解決方案的任務至關重要。我們的廣泛評估顯示,即使像GPT-4o這樣的先進模型也僅達到了39.97%的總體準確率,說明了當前AI在複雜推理和多模態整合方面的局限性。通過OlympicArena,我們旨在推動AI邁向超級智能,使其能夠應對科學及其他領域中更複雜的挑戰。我們還提供了一套全面的資源來支持AI研究,包括基準數據集、開源標註平台、詳細的評估工具以及具有自動提交功能的排行榜。
English
The evolution of Artificial Intelligence (AI) has been significantly
accelerated by advancements in Large Language Models (LLMs) and Large
Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning
abilities in problem-solving and scientific discovery (i.e., AI4Science) once
exclusive to human intellect. To comprehensively evaluate current models'
performance in cognitive reasoning abilities, we introduce OlympicArena, which
includes 11,163 bilingual problems across both text-only and interleaved
text-image modalities. These challenges encompass a wide range of disciplines
spanning seven fields and 62 international Olympic competitions, rigorously
examined for data leakage. We argue that the challenges in Olympic competition
problems are ideal for evaluating AI's cognitive reasoning due to their
complexity and interdisciplinary nature, which are essential for tackling
complex scientific challenges and facilitating discoveries. Beyond evaluating
performance across various disciplines using answer-only criteria, we conduct
detailed experiments and analyses from multiple perspectives. We delve into the
models' cognitive reasoning abilities, their performance across different
modalities, and their outcomes in process-level evaluations, which are vital
for tasks requiring complex reasoning with lengthy solutions. Our extensive
evaluations reveal that even advanced models like GPT-4o only achieve a 39.97%
overall accuracy, illustrating current AI limitations in complex reasoning and
multimodal integration. Through the OlympicArena, we aim to advance AI towards
superintelligence, equipping it to address more complex challenges in science
and beyond. We also provide a comprehensive set of resources to support AI
research, including a benchmark dataset, an open-source annotation platform, a
detailed evaluation tool, and a leaderboard with automatic submission features.Summary
AI-Generated Summary