ChatPaper.aiChatPaper

GAIA:通用人工智慧助理的基準測試平台

GAIA: a benchmark for General AI Assistants

November 21, 2023
作者: Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
cs.AI

摘要

我們推出GAIA,一個專為通用人工智慧助理設計的基準測試框架。若能攻克此框架,將成為AI研究的重要里程碑。GAIA提出需要具備推理、多模態處理、網路瀏覽及通用工具使用能力等基礎技能的現實世界問題。這些問題對人類而言概念簡單,卻對多數先進AI系統構成挑戰:數據顯示人類受試者正確率達92%,而配備外掛程式的GPT-4僅有15%。這種顯著的性能差距與當前LLM在法律、化學等專業領域超越人類的趨勢形成鮮明對比。GAIA的設計理念有別於現行AI基準測試追求「人類愈發困難的任務」的趨勢,我們主張人工通用智慧(AGI)的實現關鍵,在於系統能否在處理此類問題時展現與普通人相當的穩健性。透過GAIA方法論,我們設計了466道問題及其答案,並公開所有題目,同時保留其中300題的標準答案以建立排行榜(詳見https://huggingface.co/gaia-benchmark)。
English
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
PDF24427February 8, 2026