CEO-Bench: 에이전트는 장기 게임을 할 수 있는가?

초록

언어 모델 에이전트는 소프트웨어 공학이나 고객 서비스와 같이 고립되고 단기적인 작업에서 점점 더 능숙한 실행자가 되어가고 있다. 그러나 실제 세계의 과제는 에이전트에서 대부분 검증되지 않은 정교한 기술들의 조합을 요구한다: (1) 불확실성 속에서 장기적 시간 범위를 탐색하기; (2) 잡음이 많은 환경에서 정보를 획득하기; (3) 변화하는 세계에 적응하기; (4) 일관된 목표를 향해 여러 움직이는 부품들을 조율하기. 우리는 CEO-Bench를 소개한다. 이 벤치마크는 대표적인 실제 세계 작업, 즉 500일 동안 스타트업을 운영하는 작업을 시뮬레이션함으로써 이러한 능력들을 함께 평가한다. 에이전트는 프로그래밍 가능한 Python 인터페이스를 통해 가상 회사의 가격 책정, 마케팅, 예산 편성 및 기타 여러 측면을 관리하며, 인간 CEO와 동일한 환경에서 동일한 도전에 직면한다. 성공하려면 잡음이 많고 상호 연결된 비즈니스 데이터베이스를 분석하고, 신호를 건전한 전략으로 변환하며, 프로그래밍을 통해 많은 결정을 조정해야 한다. 가장 강력한 에이전트는 미래 현금을 예측하기 위해 고객 코호트를 시뮬레이션하는 정교한 코드를 작성하고, 숨겨진 고객 선호도를 발견하기 위해 협상 기록을 분석한다. 그럼에도 불구하고, 대부분의 최첨단 모델은 이 환경에서 어려움을 겪는다. Claude Opus 4.8과 GPT-5.5만이 초기 잔고 $1M 이상으로 마무리할 수 있었으며, 이들 조차도 지속적으로 수익을 내지는 못한다. CEO-Bench는 시간이 지남에 따라 지속적이고 적응적인 발전을 추진하는 데 필요한 지능을 측정하기 위한 첫걸음을 내딛는다.

English

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.