ARE:扩展智能体环境与评估体系
ARE: Scaling Up Agent Environments and Evaluations
September 21, 2025
作者: Pierre Andrews, Amine Benhalloum, Gerard Moreno-Torres Bertran, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Romain Froger, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Grégoire Mialon, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Thomas Scialom, Vladislav Vorotilov, Mengjue Wang, Ian Yu
cs.AI
摘要
我们推出元智能体研究环境(Meta Agents Research Environments, ARE),这是一个用于可扩展环境创建、合成或真实应用集成以及智能体编排执行的研究平台。ARE提供了简洁的抽象层,用于构建复杂多样的环境,每个环境都拥有其独特的规则、工具、内容和验证机制,从而弥合模型开发与实际部署之间的鸿沟。我们还提出了Gaia2,这是一个在ARE中构建的基准测试,旨在衡量智能体的通用能力。除了搜索与执行,Gaia2要求智能体能够处理模糊性和噪声,适应动态环境,与其他智能体协作,并在时间约束下运作。与以往基准不同,Gaia2以异步方式运行,揭示了静态设置中不可见的新失效模式。我们的实验表明,在智能光谱上,没有系统能全面领先:更强的推理能力往往以效率为代价,预算扩展曲线趋于平缓,这凸显了对新架构和自适应计算策略的需求。更重要的是,ARE的抽象层使得Gaia2能够持续扩展至其他环境,赋能社区快速创建针对其领域定制的新基准。在人工智能发展的后半程,进步愈发依赖于定义有意义任务和稳健评估,以推动前沿能力向前发展。
English
We introduce Meta Agents Research Environments (ARE), a research platform for
scalable creation of environments, integration of synthetic or real
applications, and execution of agentic orchestrations. ARE provides simple
abstractions to build complex and diverse environments, each with their own
rules, tools, content, and verifiers, helping to bridge the gap between model
development and real-world deployment. We also propose Gaia2, a benchmark built
in ARE and designed to measure general agent capabilities. Beyond search and
execution, Gaia2 requires agents to handle ambiguities and noise, adapt to
dynamic environments, collaborate with other agents, and operate under temporal
constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new
failure modes that are invisible in static settings. Our experiments show that
no system dominates across the intelligence spectrum: stronger reasoning often
comes at the cost of efficiency, and budget scaling curves plateau,
highlighting the need for new architectures and adaptive compute strategies.
Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2
to other environments, empowering the community to rapidly create new
benchmarks tailored to their domains. In AI's second half, progress
increasingly depends on defining meaningful tasks and robust evaluations to
drive frontier capabilities forward.