GAIA：通用人工智慧助理的基準。

摘要

我們介紹了 GAIA，這是一個用於智能通用助理的基準，如果解決了這個問題，將代表人工智慧研究的一個里程碑。GAIA提出了需要一系列基本能力的現實世界問題，如推理、多模態處理、網頁瀏覽和通常的工具使用能力。對於人類來說，GAIA的問題在概念上很簡單，但對大多數先進的人工智慧來說具有挑戰性：我們展示人類回答者獲得了92％，而裝備了插件的GPT-4只有15％。這種顯著的表現差異與最近LLM在需要專業技能的任務上超越人類的趨勢形成對比，例如在法律或化學領域。GAIA的理念與當前人工智慧基準的趨勢不同，建議專注於對人類來說更加困難的任務。我們認為人工通用智能（AGI）的來臨取決於系統展示出對這類問題與普通人一樣的堅韌性。使用GAIA的方法，我們設計了466個問題及其答案。我們釋出了這些問題，同時保留了其中300個問題的答案，以便支持一個排行榜，該排行榜可在https://huggingface.co/gaia-benchmark 上找到。

English

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

GAIA：通用人工智慧助理的基準。

GAIA: a benchmark for General AI Assistants

摘要

Support