CoffeeBench：面向异构多智能体经济的长周期LLM智能体基准测试

摘要

随着LLM智能体能够处理越来越长期的任务，评估其在经济系统中的表现变得日益重要。与主要评估单一智能体与被动环境交互的现有基准不同，经济系统本质上是多智能体系统，要求自主智能体在长期内追求自身目标的同时，进行沟通、协商和交易。我们推出了CoffeeBench，这是一个用于评估LLM智能体在由异构企业构成的长期多智能体经济中表现的基准。在CoffeeBench中，两名农民、两名烘焙师和两名零售商在90天的模拟中自主经营业务，每位参与者通过沟通和交易追求累计净收入最大化，同时管理现金、库存和定价。被评估的模型控制一家咖啡烘焙商，而其余企业由固定参考智能体控制。在多个近期开源的专有LLM中，所有模型均优于不采取任何行动的被动基线，大多数模型实现了正净收入。对智能体行为的分析揭示了长期经济互动中的显著差异：表现更好的模型与其他企业沟通更积极，而Claude Haiku 4.5则表现出“空闲漂移”的失败模式，尽管能生成连贯的评估和计划，却反复选择不作为。我们公开了代码和智能体轨迹，以支持未来研究。

English

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.