ARE：扩展智能体环境与评估体系

摘要

我们推出元智能体研究环境（Meta Agents Research Environments, ARE），这是一个用于可扩展环境创建、合成或真实应用集成以及智能体编排执行的研究平台。ARE提供了简洁的抽象层，用于构建复杂多样的环境，每个环境都拥有其独特的规则、工具、内容和验证机制，从而弥合模型开发与实际部署之间的鸿沟。我们还提出了Gaia2，这是一个在ARE中构建的基准测试，旨在衡量智能体的通用能力。除了搜索与执行，Gaia2要求智能体能够处理模糊性和噪声，适应动态环境，与其他智能体协作，并在时间约束下运作。与以往基准不同，Gaia2以异步方式运行，揭示了静态设置中不可见的新失效模式。我们的实验表明，在智能光谱上，没有系统能全面领先：更强的推理能力往往以效率为代价，预算扩展曲线趋于平缓，这凸显了对新架构和自适应计算策略的需求。更重要的是，ARE的抽象层使得Gaia2能够持续扩展至其他环境，赋能社区快速创建针对其领域定制的新基准。在人工智能发展的后半程，进步愈发依赖于定义有意义任务和稳健评估，以推动前沿能力向前发展。

English

We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.

ARE：扩展智能体环境与评估体系

ARE: Scaling Up Agent Environments and Evaluations

摘要

Support