LLMエージェントはCFOになれるのか？動的企業環境におけるリソース配分のベンチマーク

要旨

大規模言語モデル（LLM）は、複雑なタスクにわたる推論、計画、実行を可能にするエージェントシステムを実現したが、不確実性下での効果的な資源配分が可能かどうかは未だ明らかでない。短期的な反応的決定とは異なり、資源配分は、時間をかけて希少資源を投入しつつ、競合する目標のバランスを取り、将来のニーズに対する柔軟性を維持することを要求する。本論文では、長期的な企業資源配分におけるエージェント評価のための最初のベンチマークであるEnterpriseArenaを提案する。これは、企業レベルの財務データ、匿名化された業務文書、マクロ経済及び業界シグナル、専門家によって検証された業務規則を組み合わせた132ヶ月間の企業シミュレーターにおいて、CFOスタイルの意思決定を具体化する。環境は部分的にしか観測できず、予算編成された組織ツールを通じてのみ状態が明らかになるため、エージェントは情報取得と希少資源の節約のトレードオフを迫られる。11の先進的なLLMを用いた実験結果から、この設定が依然として非常に困難であることが示された：全期間を生き残った実行は16%のみであり、大規模モデルが小規模モデルよりも確実に優れているわけではなかった。これらの結果は、不確実性下での長期的資源配分が、現在のLLMエージェントにおける特有の能力ギャップであることを明らかにする。

English

Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce EnterpriseArena, the first benchmark for evaluating agents on long-horizon enterprise resource allocation. It instantiates CFO-style decision-making in a 132-month enterprise simulator combining firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules. The environment is partially observable and reveals the state only through budgeted organizational tools, forcing agents to trade off information acquisition against conserving scarce resources. Experiments on eleven advanced LLMs show that this setting remains highly challenging: only 16% of runs survive the full horizon, and larger models do not reliably outperform smaller ones. These results identify long-horizon resource allocation under uncertainty as a distinct capability gap for current LLM agents.

LLMエージェントはCFOになれるのか？動的企業環境におけるリソース配分のベンチマーク

Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

要旨

Support