Gaia2:动态异步环境下大语言模型智能体的基准测试
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
February 12, 2026
作者: Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom
cs.AI
摘要
我们推出Gaia2——一个在异步现实环境中评估大语言模型智能体的基准测试平台。与以往静态或同步评估不同,Gaia2引入了环境独立于智能体行动自主演化的场景,要求智能体在时间约束下运行,适应噪声干扰与动态事件,处理模糊信息并实现多智能体协作。每个场景均配有写入式行动验证器,支持细粒度的行动级评估,使Gaia2可直接用于基于可验证奖励的强化学习。我们对顶尖专有模型和开源模型的测试表明:GPT-5(高配版)以42%的pass@1得分位居综合榜首,但在时效性任务中表现不佳;Claude-4 Sonnet通过牺牲精度与速度控制成本;开源模型中Kimi-K2以21%的pass@1领先。这些结果揭示了推理能力、效率与鲁棒性之间的本质权衡,同时暴露出缩小“模拟与现实差距”的挑战。Gaia2基于开源智能体研究环境平台构建,采用消费级环境设计并具备易扩展特性。通过将Gaia2与基础ARE框架同步开源,我们旨在为学界提供灵活的基础设施,用于开发、评估和训练下一代实用智能体系统。
English
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.