ARE: 에이전트 환경 및 평가의 확장

초록

우리는 환경의 확장 가능한 생성, 합성 또는 실제 애플리케이션의 통합, 그리고 에이전트 오케스트레이션의 실행을 위한 연구 플랫폼인 Meta Agents Research Environments(ARE)를 소개한다. ARE는 각각 고유의 규칙, 도구, 콘텐츠, 검증기를 갖춘 복잡하고 다양한 환경을 구축하기 위한 간단한 추상화를 제공함으로써 모델 개발과 실제 배포 간의 격차를 해소하는 데 도움을 준다. 또한, 우리는 ARE 내에서 구축되고 일반적인 에이전트 능력을 측정하기 위해 설계된 벤치마크인 Gaia2를 제안한다. Gaia2는 검색과 실행을 넘어, 에이전트가 모호성과 노이즈를 처리하고, 동적 환경에 적응하며, 다른 에이전트와 협력하고, 시간적 제약 하에서 작동할 것을 요구한다. 기존 벤치마크와 달리, Gaia2는 비동기적으로 실행되며, 정적 설정에서는 보이지 않는 새로운 실패 모드를 드러낸다. 우리의 실험은 어떤 시스템도 지능 스펙트럼 전반에 걸쳐 우위를 점하지 않음을 보여준다: 더 강력한 추론은 종종 효율성의 비용을 수반하며, 예산 확장 곡선은 정체되어, 새로운 아키텍처와 적응형 컴퓨팅 전략의 필요성을 강조한다. 아마도 더 중요한 것은, ARE의 추상화는 Gaia2를 다른 환경으로 지속적으로 확장할 수 있게 하여, 커뮤니티가 자신의 도메인에 맞춘 새로운 벤치마크를 신속하게 생성할 수 있도록 한다. AI의 후반부에서, 진전은 점점 더 의미 있는 작업과 견고한 평가를 정의하여 최첨단 능력을 앞으로 나아가게 하는 데 달려 있다.

English

We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.