Gaia2:动态异步环境下的LLM智能体基准测试
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
February 12, 2026
作者: Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom
cs.AI
摘要
我们推出Gaia2——一个在异步现实环境中评估大语言模型智能体的基准测试平台。与以往静态或同步评估不同,Gaia2引入了环境独立于智能体行动自主演化的场景,要求智能体在时间约束下运行,适应嘈杂动态事件,解决模糊性问题,并与其他智能体进行协作。每个场景都配有写入式验证器,支持细粒度的行动级评估,使Gaia2可直接用于基于可验证奖励的强化学习。我们对顶尖专有模型和开源模型的评估表明:GPT-5(高配版)以42%的pass@1得分位居综合榜首,但在时间敏感任务中表现不佳;Claude-4 Sonnet模型在精度与速度间权衡以控制成本;开源模型中Kimi-K2以21%的pass@1领先。这些结果揭示了推理能力、效率、鲁棒性之间的根本性权衡,并凸显了缩小"仿真到现实"差距的挑战。Gaia2基于消费级环境构建,采用开源智能体研究环境平台,设计具备易扩展性。通过将Gaia2与基础ARE框架同步开源,我们旨在为学界提供灵活的基础设施,用于开发、评估和训练下一代实用智能体系统。
English
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.