ChatPaper.aiChatPaper

CEO-Bench:智能體能否進行長期博弈?

CEO-Bench: Can Agents Play the Long Game?

June 16, 2026
作者: Haozhe Chen, Karthik Narasimhan, Zhuang Liu
cs.AI

摘要

語言模型代理在諸如軟體工程和客戶服務等孤立的短期任務上正變得越來越擅長執行。然而,現實世界的挑戰需要結合多種複雜技能,而這些技能在代理身上大多尚未經過考驗:(1) 在不確定性中導航長期時間跨度;(2) 在嘈雜的環境中獲取資訊;(3) 適應不斷變化的世界;(4) 協調多個動態部分以達成連貫目標。我們推出了CEO-Bench,透過模擬一個具代表性的現實任務——經營一家初創公司500天——來共同評估這些能力。代理透過可程式的Python介面管理一家虛構公司的定價、行銷、預算及其他許多方面,與人類CEO在相同的環境中運作,並面臨相同的挑戰。成功需要分析嘈雜且相互關聯的商業數據庫,將訊號轉化為穩健的策略,並透過程式設計協調眾多決策。最強的代理會編寫複雜的程式碼,模擬客戶群體以預測未來現金流,並挖掘談判歷史以發現隱藏的客戶偏好。即便如此,大多數最先進的模型在這種環境中仍難以應對。只有Claude Opus 4.8和GPT-5.5在超過100萬美元的起始資金後完成任務,且兩者都未能持續獲利。CEO-Bench朝衡量驅動持續、適應性進步所需的智慧邁出了第一步。
English
Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.