CEO-Bench: エージェントは長期戦略を遂行できるか？

要旨

言語モデルエージェントは、ソフトウェア工学やカスタマーサービスといった、孤立した短期タスクにおいて熟練した実行者となりつつある。しかし現実世界の課題には、エージェントにおいてはほとんど検証されていない高度なスキルの組み合わせが必要となる。(1)不確実性の中で長期的な展望を見据えること、(2)ノイズの多い環境で情報を取得すること、(3)変化する世界に適応すること、(4)複数の可動要素を整合的な目標に向けて調整すること、である。本稿では、これらすべての能力を評価するベンチマークCEO-Benchを提案する。これは現実世界の代表的なタスク、すなわち500日間にわたって新興企業を運営することをシミュレーションする。エージェントはプログラム可能なPythonインターフェースを通じて、架空企業の価格設定、マーケティング、予算編成など多岐にわたる側面を管理し、人間のCEOと同じ環境で同じ課題に直面する。成功には、ノイズを含み相互に関連するビジネスデータベースを分析し、シグナルを的確な戦略に変換し、プログラミングによって多くの意思決定を調整することが求められる。最も強力なエージェントは、将来のキャッシュを予測するために顧客コホートをシミュレーションしたり、交渉履歴を解析して隠れた顧客の嗜好を明らかにする洗練されたコードを記述する。それでもなお、最先端のモデルのほとんどはこの環境で苦戦する。Claude Opus 4.8とGPT-5.5のみが開始残高100万ドルを超える結果を残したが、いずれも一貫して利益を上げるには至っていない。CEO-Benchは、持続的かつ適応的な長期的進歩を推進するために必要な知能を測定するための、第一歩となる。

English

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.