ChatPaper.aiChatPaper

EcoGym:大型语言模型在交互式经济环境中长程规划与执行能力的评估框架

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

February 10, 2026
作者: Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, Wangchunshu Zhou
cs.AI

摘要

長時程規劃被廣泛認為是基於大型語言模型的自主智能體核心能力,然而現有評估框架普遍存在片段化、領域特定性或未能紮根於持續經濟動態的侷限。我們提出EcoGym——一個面向互動經濟環境中連續性規劃與決策的通用基準平台。該平台包含販售、自由職業與運營三大差異化場景,通過標準化接口實現統一的決策流程,並在有效無界時域(評估時採用365日循環對應1000+決策步)中實施預算化行動管理。EcoGym的評估體系以商業相關成果(如淨資產、收入、日活躍用戶)為核心指標,重點考察部分可觀測性與隨機性條件下智能體的長期戰略連貫性與魯棒性。在對11個前沿大語言模型的實驗中,我們發現系統性矛盾:沒有任何單一模型能在三類場景中均保持優勢。關鍵在於,模型要麼在高層戰略規劃要麼在具體行動執行層面呈現顯著次優性。EcoGym作為開放可擴展的測試平台發布,旨在為透明化長時程智能體評估及研究現實經濟環境中可控性與效用權衡提供基礎設施。
English
Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
PDF91February 13, 2026