GAIA: 일반 AI 어시스턴트를 위한 벤치마크

초록

우리는 일반 AI 어시스턴트를 위한 벤치마크인 GAIA를 소개합니다. 이 벤치마크를 해결한다면 AI 연구의 중요한 이정표가 될 것입니다. GAIA는 추론, 다중 모달리티 처리, 웹 브라우징, 그리고 일반적인 도구 사용 능력과 같은 기본적인 능력들을 요구하는 실제 세계의 질문들을 제시합니다. GAIA의 질문들은 개념적으로 인간에게는 간단하지만 대부분의 고급 AI에게는 도전적인 과제입니다: 우리는 인간 응답자가 92%의 정확도를 보이는 반면, 플러그인을 장착한 GPT-4는 15%의 정확도를 보임을 보여줍니다. 이러한 현저한 성능 차이는 최근의 대형 언어 모델(LLM)들이 법률이나 화학과 같은 전문 기술을 요구하는 과제에서 인간을 능가하는 추세와 대조를 이룹니다. GAIA의 철학은 인간에게 점점 더 어려운 과제를 목표로 하는 현재의 AI 벤치마크 추세와는 다릅니다. 우리는 인공 일반 지능(AGI)의 출현이 이러한 질문들에 대해 평균적인 인간과 유사한 견고성을 보이는 시스템의 능력에 달려 있다고 주장합니다. GAIA의 방법론을 사용하여 우리는 466개의 질문과 그 답변을 고안했습니다. 우리는 질문들을 공개하면서 300개의 답변은 리더보드를 구동하기 위해 보유하고 있습니다. 리더보드는 https://huggingface.co/gaia-benchmark에서 확인할 수 있습니다.

English

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

GAIA: 일반 AI 어시스턴트를 위한 벤치마크

GAIA: a benchmark for General AI Assistants

초록

Support