ChatPaper.aiChatPaper

GAIA:通用人工智能助手的基准测试

GAIA: a benchmark for General AI Assistants

November 21, 2023
作者: Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
cs.AI

摘要

我们介绍了GAIA,这是一个通用人工智能助手的基准测试,如果解决了这个问题,将代表着人工智能研究的一个里程碑。GAIA提出了需要一系列基本能力的现实世界问题,如推理、多模态处理、网页浏览以及通用工具使用熟练度。对于人类来说,GAIA的问题在概念上很简单,但对大多数先进的人工智能来说具有挑战性:我们展示了人类回答者获得92\%,而装有插件的GPT-4只有15\%。这种显著的性能差距与最近LLMs在需要专业技能的任务上胜过人类的趋势形成鲜明对比,比如法律或化学领域。GAIA的理念与当前人工智能基准测试的趋势背道而驰,建议针对那些对人类来说越来越困难的任务。我们认为,人工通用智能(AGI)的出现取决于系统在这些问题上展现出与普通人类类似的鲁棒性能。利用GAIA的方法,我们设计了466个问题及其答案。我们发布了这些问题,同时保留了其中300个问题的答案,以支持一个排行榜,网址为https://huggingface.co/gaia-benchmark。
English
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
PDF21924December 15, 2024