大型语言模型代理能否胜任首席财务官？动态企业环境中资源分配的基准研究

摘要

大型语言模型（LLM）催生了能够对复杂任务进行推理、规划和执行的智能体系统，但其在不确定性下能否有效分配资源仍不明确。与短视界的即时决策不同，资源分配需要在时间维度上调配稀缺资源，同时平衡多重竞争目标并为未来需求保留灵活性。我们推出EnterpriseArena——首个针对长周期企业资源分配的智能体评估基准，该平台通过结合企业级财务数据、匿名化商业文件、宏观经济与行业信号以及专家验证的运营规则，在132个月的企业模拟器中实现了CFO式决策。该环境具有部分可观测性，仅通过预算化组织工具披露状态，迫使智能体在信息获取与资源节约之间进行权衡。对11种先进LLM的实验表明，这一场景仍极具挑战性：仅16%的运行能完整度过整个周期，且大型模型并未稳定优于小型模型。这些结果揭示了不确定性下的长周期资源分配是当前LLM智能体存在的显著能力短板。

English

Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce EnterpriseArena, the first benchmark for evaluating agents on long-horizon enterprise resource allocation. It instantiates CFO-style decision-making in a 132-month enterprise simulator combining firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules. The environment is partially observable and reveals the state only through budgeted organizational tools, forcing agents to trade off information acquisition against conserving scarce resources. Experiments on eleven advanced LLMs show that this setting remains highly challenging: only 16% of runs survive the full horizon, and larger models do not reliably outperform smaller ones. These results identify long-horizon resource allocation under uncertainty as a distinct capability gap for current LLM agents.

大型语言模型代理能否胜任首席财务官？动态企业环境中资源分配的基准研究

Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

摘要

Support