ARE: エージェント環境と評価のスケールアップ

要旨

我々は、環境のスケーラブルな作成、合成または実アプリケーションの統合、およびエージェントオーケストレーションの実行のための研究プラットフォームであるMeta Agents Research Environments（ARE）を紹介する。AREは、複雑で多様な環境を構築するためのシンプルな抽象化を提供し、各環境には独自のルール、ツール、コンテンツ、および検証器が備わっており、モデル開発と実世界での展開の間のギャップを埋めるのに役立つ。また、ARE上に構築され、一般的なエージェント能力を測定するために設計されたベンチマークであるGaia2を提案する。Gaia2は、検索と実行を超えて、エージェントが曖昧さやノイズを処理し、動的環境に適応し、他のエージェントと協力し、時間的制約の下で動作することを要求する。従来のベンチマークとは異なり、Gaia2は非同期で実行され、静的な設定では見えない新しい失敗モードを浮き彫りにする。我々の実験では、知能スペクトル全体で優位に立つシステムは存在せず、より強力な推論はしばしば効率の低下を伴い、予算スケーリング曲線は頭打ちになることが示され、新しいアーキテクチャと適応的な計算戦略の必要性が強調される。おそらくより重要なことに、AREの抽象化により、Gaia2を他の環境に継続的に拡張することが可能となり、コミュニティが自らのドメインに特化した新しいベンチマークを迅速に作成することを可能にする。AIの後半において、進歩はますます意味のあるタスクと堅牢な評価を定義し、フロンティア能力を前進させることに依存している。

English

We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.

ARE: エージェント環境と評価のスケールアップ

ARE: Scaling Up Agent Environments and Evaluations

要旨

Support