奥林匹克竞技场：为超智能人工智能基准测试多学科认知推理

摘要

人工智能（AI）的发展在很大程度上得益于大型语言模型（LLMs）和大型多模态模型（LMMs）的进步，逐渐展示出在问题解决和科学发现方面潜在的认知推理能力（即AI4Science），这些能力曾经只属于人类智慧。为了全面评估当前模型在认知推理能力方面的表现，我们引入了OlympicArena，其中包括了11,163个双语问题，涵盖了纯文本和交错文本-图像两种模态。这些挑战涵盖了七个领域和62个国际奥林匹克比赛，严格审查以防止数据泄漏。我们认为奥林匹克竞赛问题中的挑战非常适合评估AI的认知推理能力，因为这些问题的复杂性和跨学科性对于解决复杂科学难题和促进发现至关重要。除了使用仅答案为标准跨不同学科评估性能之外，我们还从多个角度进行了详细实验和分析。我们深入研究了模型的认知推理能力、它们在不同模态下的表现以及它们在过程级评估中的结果，这对于需要复杂推理和长篇解决方案的任务至关重要。我们的广泛评估显示，即使像GPT-4o这样的先进模型也仅实现了39.97%的整体准确率，说明了当前AI在复杂推理和多模态整合方面的局限性。通过OlympicArena，我们旨在推动AI迈向超级智能，使其能够应对更复杂的科学及其他挑战。我们还提供了一套全面的资源来支持AI研究，包括基准数据集、开源注释平台、详细评估工具以及具有自动提交功能的排行榜。

English

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

奥林匹克竞技场：为超智能人工智能基准测试多学科认知推理

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

摘要

Support