OAgents：効果的なエージェント構築に関する実証的研究

要旨

近年、エージェント型AIはますます人気のある研究分野となっています。しかし、現在のエージェント研究の実践には標準化と科学的厳密性が欠けており、異なる手法間の公平な比較が困難であると私たちは主張します。その結果、エージェントフレームワークにおけるさまざまな設計選択が有効性にどのように影響するかは依然として不明であり、その進歩を測定することは依然として困難です。本研究では、GAIAベンチマークとBrowseCompを用いて、主要なエージェントコンポーネントにおける人気のある設計選択の影響を公平かつ厳密に検証するための系統的な実証研究を行います。標準的な評価プロトコルの欠如により、過去の研究（オープンソースのものも含む）は再現性がなく、ランダム実行間で大きなばらつきがあることがわかりました。そこで、比較を安定させるためにより堅牢な評価プロトコルを導入します。私たちの研究は、効果的なエージェントにとってどのコンポーネントと設計が重要であるかを明らかにし、一方で論理的と思われるものの冗長なものも特定します。これらの知見に基づいて、私たちはOAgentsという新しい基盤エージェントフレームワークを構築し、オープンソースとして公開しました。OAgentsは、オープンソースプロジェクトの中で最先端の性能を達成し、さまざまなエージェントコンポーネントのモジュール設計を提供することで、エージェント型AIの将来の研究を促進します。

English

Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.

OAgents：効果的なエージェント構築に関する実証的研究

OAgents: An Empirical Study of Building Effective Agents

要旨

Support