CoffeeBench: 異種マルチエージェント経済における長期ホライズンLLMエージェントのベンチマーク

要旨

LLMエージェントがますます長期的なタスクを遂行できるようになるにつれ、経済システムにおけるその性能を評価することの重要性が高まっている。既存のベンチマークの多くは、受動的な環境と相互作用する単一のエージェントを主に評価するが、経済システムは本質的にマルチエージェントであり、自律エージェントが長期間にわたって自らの目的を追求しながら、コミュニケーション、交渉、取引を行う必要がある。本稿では、異質な企業から構成される長期的マルチエージェント経済においてLLMエージェントを評価するためのベンチマーク、CoffeeBenchを紹介する。CoffeeBenchでは、2人の農家、2人の焙煎業者、2人の小売業者が90日間のシミュレーションの中で自律的に事業を運営し、コミュニケーションと取引を通じて累積純利益を最大化することを目指す。各エージェントは、現金、在庫、価格設定を管理する。評価対象のモデルは1つのコーヒー焙煎業者を制御し、残りの企業は固定された参照エージェントによって制御される。最近のオープンウェイトおよびプロプライエタリな複数のLLMにおいて、すべてのモデルが何も行動しない受動的ベースラインを上回り、大半が正の純利益を達成した。エージェントの行動分析からは、長期的な経済的相互作用に大きな差異が明らかになった。すなわち、性能の高いモデルほど他の企業と積極的にコミュニケーションを取る一方、Claude Haiku 4.5では、首尾一貫した評価や計画を生成するにもかかわらず、繰り返し無行動を選択する「アイドル漂流」という障害モードが観察された。今後の研究を支援するため、コードとエージェントの軌跡を公開する。

English

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.