CEO-Bench:智能体能否进行长期规划?
CEO-Bench: Can Agents Play the Long Game?
June 16, 2026
作者: Haozhe Chen, Karthik Narasimhan, Zhuang Liu
cs.AI
摘要
语言模型代理正逐渐成为软件工程和客户服务等孤立、短周期任务的熟练执行者。然而,现实世界中的挑战需要多种复杂技能的有机结合,而这些技能在代理中仍鲜有验证:(1)在不确定性中驾驭长周期任务;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个动态环节以实现连贯目标。为此,我们推出CEO-Bench,通过模拟一个具有代表性的现实任务——运营一家初创公司500天——来综合评估这些能力。代理通过可编程的Python接口管理虚构公司的定价、营销、预算等诸多方面,在相同的环境中面临与人类CEO相同的挑战。成功需要分析嘈杂且相互关联的商业数据库,将信号转化为可靠策略,并通过编程协调众多决策。最强的代理会编写复杂的代码,模拟客户群体以预测未来现金流,并从谈判历史中挖掘隐藏的客户偏好。即便如此,大多数最先进的模型在此环境下仍举步维艰。只有Claude Opus 4.8和GPT-5.5能在起始资金100万美元以上保持正收益,且两者均无法持续盈利。CEO-Bench迈出了衡量驱动持续性、适应性进步所需智能的第一步。
English
Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.