GAIA: 汎用AIアシスタントのためのベンチマーク

要旨

我々は、AI研究のマイルストーンとなる汎用AIアシスタントのベンチマーク「GAIA」を提案する。GAIAは、推論能力、マルチモーダル処理、ウェブ閲覧、一般的なツール使用技能といった基礎能力を必要とする現実世界の質問を提示する。GAIAの質問は人間にとって概念的には単純であるが、最先端のAIの多くにとっては挑戦的である。人間の回答正解率が92%であるのに対し、プラグイン装備のGPT-4では15%に留まることを示す。この顕著な性能差は、法律や化学などの専門技能を要する課題で大規模言語モデルが人間を凌駕する最近の傾向とは対照的である。GAIAの哲学は、人間にとってますます困難な課題を標的とする現在のAIベンチマークの潮流とは一線を画す。我々は、人工汎用知能（AGI）の到来は、こうした質問に対して平均的な人間と同レベルの堅牢性をシステムが示せるかにかかっていると主張する。GAIAの手法に基づき、466問の質問と回答を設計。回答を非公開とした300問を含む全質問を公開し、https://huggingface.co/gaia-benchmark でリーダーボードを運用する。

English

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

GAIA: 汎用AIアシスタントのためのベンチマーク

GAIA: a benchmark for General AI Assistants

要旨

Support