ChatPaper.aiChatPaper

GAIA:通用人工智能助手的基准测试平台

GAIA: a benchmark for General AI Assistants

November 21, 2023
作者: Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
cs.AI

摘要

我们推出GAIA——一个面向通用人工智能助手的基准测试框架,其解决方案将成为AI研究的重要里程碑。GAIA提出了需要运用推理、多模态处理、网络浏览及工具使用等基础能力的现实世界问题。这些对人类而言概念简单的问题,却对当前最先进的AI系统构成挑战:数据显示人类受访者正确率达到92%,而搭载插件的GPT-4模型仅为15%。这种显著的性能差距与当前大语言模型在法律、化学等专业领域超越人类的发展趋势形成鲜明对比。GAIA的设计理念有别于追求人类难以完成任务的现行AI评测趋势,我们主张通用人工智能(AGI)的实现关键在于系统能否在此类问题上展现出与普通人相当的稳健性。基于GAIA方法论,我们构建了466个测试问题及其参考答案,其中300道题的答案暂不公开以支撑排行榜机制(详见https://huggingface.co/gaia-benchmark)。
English
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
PDF24727April 13, 2026