奥林匹克竞技场:为超智能人工智能基准测试多学科认知推理
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
June 18, 2024
作者: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
cs.AI
摘要
人工智能(AI)的发展在很大程度上得益于大型语言模型(LLMs)和大型多模态模型(LMMs)的进步,逐渐展示出在问题解决和科学发现方面潜在的认知推理能力(即AI4Science),这些能力曾经只属于人类智慧。为了全面评估当前模型在认知推理能力方面的表现,我们引入了OlympicArena,其中包括了11,163个双语问题,涵盖了纯文本和交错文本-图像两种模态。这些挑战涵盖了七个领域和62个国际奥林匹克比赛,严格审查以防止数据泄漏。我们认为奥林匹克竞赛问题中的挑战非常适合评估AI的认知推理能力,因为这些问题的复杂性和跨学科性对于解决复杂科学难题和促进发现至关重要。除了使用仅答案为标准跨不同学科评估性能之外,我们还从多个角度进行了详细实验和分析。我们深入研究了模型的认知推理能力、它们在不同模态下的表现以及它们在过程级评估中的结果,这对于需要复杂推理和长篇解决方案的任务至关重要。我们的广泛评估显示,即使像GPT-4o这样的先进模型也仅实现了39.97%的整体准确率,说明了当前AI在复杂推理和多模态整合方面的局限性。通过OlympicArena,我们旨在推动AI迈向超级智能,使其能够应对更复杂的科学及其他挑战。我们还提供了一套全面的资源来支持AI研究,包括基准数据集、开源注释平台、详细评估工具以及具有自动提交功能的排行榜。
English
The evolution of Artificial Intelligence (AI) has been significantly
accelerated by advancements in Large Language Models (LLMs) and Large
Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning
abilities in problem-solving and scientific discovery (i.e., AI4Science) once
exclusive to human intellect. To comprehensively evaluate current models'
performance in cognitive reasoning abilities, we introduce OlympicArena, which
includes 11,163 bilingual problems across both text-only and interleaved
text-image modalities. These challenges encompass a wide range of disciplines
spanning seven fields and 62 international Olympic competitions, rigorously
examined for data leakage. We argue that the challenges in Olympic competition
problems are ideal for evaluating AI's cognitive reasoning due to their
complexity and interdisciplinary nature, which are essential for tackling
complex scientific challenges and facilitating discoveries. Beyond evaluating
performance across various disciplines using answer-only criteria, we conduct
detailed experiments and analyses from multiple perspectives. We delve into the
models' cognitive reasoning abilities, their performance across different
modalities, and their outcomes in process-level evaluations, which are vital
for tasks requiring complex reasoning with lengthy solutions. Our extensive
evaluations reveal that even advanced models like GPT-4o only achieve a 39.97%
overall accuracy, illustrating current AI limitations in complex reasoning and
multimodal integration. Through the OlympicArena, we aim to advance AI towards
superintelligence, equipping it to address more complex challenges in science
and beyond. We also provide a comprehensive set of resources to support AI
research, including a benchmark dataset, an open-source annotation platform, a
detailed evaluation tool, and a leaderboard with automatic submission features.Summary
AI-Generated Summary