EcoGym:面向交互式经济环境中长程规划与执行的LLM评估框架
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
February 10, 2026
作者: Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, Wangchunshu Zhou
cs.AI
摘要
长周期规划被广泛认为是基于大型语言模型的自主智能体核心能力,然而现有评估框架普遍存在片段化、领域特定性或缺乏持续性经济动态基础的问题。我们推出EcoGym——一个面向交互式经济环境中连续规划与决策的通用基准测试平台。该平台包含自动售货、自由职业和运营管理三类异构环境,通过标准化接口实现统一决策流程,并在有效无界时间跨度(评估时按365日循环可达1000+步骤)内实施预算化行动。EcoGym的评估以商业相关成果(如净资产、收入、日活跃用户)为核心指标,重点关注部分可观测性和随机性条件下的长期战略连贯性与鲁棒性。对11个主流大模型的实验揭示出系统性矛盾:没有任何单一模型能在三种场景中全面领先。关键发现表明,模型在高层战略规划或具体行动执行层面均存在显著次优性。EcoGym作为开放可扩展的测试平台,旨在为透明化长周期智能体评估及现实经济环境中可控性与效用权衡研究提供支撑。
English
Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.