GAIA：通用人工智能助手的基准测试平台

摘要

我们推出GAIA——一个面向通用人工智能助手的基准测试体系，其解决方案将成为AI研究的重要里程碑。GAIA提出了需要综合运用推理、多模态处理、网络浏览及工具使用等基础能力的现实世界问题。这些问题对人类而言概念简单，但对最先进的AI系统却极具挑战性：数据显示人类受访者正确率达92%，而配备插件的GPT-4仅为15%。这种显著的性能差距与当前LLM在法律、化学等专业领域超越人类的趋势形成鲜明对比。GAIA的设计理念有别于追求人类难以完成任务的现行AI评测趋势，我们认为通用人工智能（AGI）的到来取决于系统在此类问题上能否展现与普通人相当的稳健性。通过GAIA方法论，我们设计了466个问题及其答案，现已公开问题集并保留其中300道题的答案以支持排行榜（详见https://huggingface.co/gaia-benchmark）。

English

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

GAIA：通用人工智能助手的基准测试平台

GAIA: a benchmark for General AI Assistants

摘要

Support