올림픽아레나 메달 순위: 지금까지 가장 똑똑한 AI는 누구인가?

초록

본 보고서에서 우리는 다음과 같은 질문을 제기합니다: 올림픽아레나(OlympicArena, 초지능 AI를 위한 올림픽 수준의 다학제적·다중모달 벤치마크)를 기준으로 측정했을 때, 현재까지 가장 지능적인 AI 모델은 누구인가? 우리는 특히 최근 출시된 모델들인 Claude-3.5-Sonnet, Gemini-1.5-Pro, 그리고 GPT-4o에 초점을 맞춥니다. 우리는 처음으로 올림픽 메달 테이블 방식을 제안하여 다양한 학문 분야에서의 종합적인 성능을 기준으로 AI 모델들을 순위 매깁니다. 실험 결과는 다음과 같습니다: (1) Claude-3.5-Sonnet은 GPT-4o에 비해 전반적으로 매우 경쟁력 있는 성능을 보이며, 몇몇 과목(즉, 물리학, 화학, 생물학)에서는 GPT-4o를 능가합니다. (2) Gemini-1.5-Pro와 GPT-4V는 GPT-4o와 Claude-3.5-Sonnet 바로 뒤에 연이어 순위를 차지하지만, 그들 사이에는 명확한 성능 격차가 존재합니다. (3) 오픈소스 커뮤니티의 AI 모델들은 이러한 독점 모델들에 비해 성능이 크게 뒤쳐집니다. (4) 이 모델들이 이 벤치마크에서 보인 성능은 만족스럽지 못한 수준으로, 우리가 초지능을 달성하기까지는 아직 갈 길이 멀다는 것을 보여줍니다. 우리는 이 벤치마크에서 최신 강력한 모델들의 성능을 지속적으로 추적하고 평가할 것을 약속합니다(https://github.com/GAIR-NLP/OlympicArena에서 확인 가능).

English

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).