ChatPaper.aiChatPaper

TowerMind:面向大模型智能体的塔防游戏学习环境与基准平台

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

January 9, 2026
作者: Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison
cs.AI

摘要

近期大语言模型(LLM)的重大突破使其成为智能体的理想范式,其中长期规划与决策能力正逐渐成为适应多样化场景与任务的核心通用能力。实时策略(RTS)游戏因其需要宏观战略规划与微观战术调整的双重特性,成为评估这两项能力的理想试验场。然而现有基于RTS游戏的测试环境或存在计算资源要求较高的问题,或缺乏对文本观察的支持,制约了LLM评估的开展。为此,我们推出了TowerMind——一个基于RTS游戏中塔防子类的新型测试环境。该环境在保留RTS游戏核心评估优势的同时,具备低计算开销和多模态观察空间(包括像素级、文本化及结构化游戏状态表征)。此外,TowerMind支持模型幻觉评估并提供高度可定制性。我们设计了五个基准关卡,在不同多模态输入设置下对多种主流LLM进行测试。结果表明,LLM在能力维度与幻觉维度均与人类专家存在明显差距。实验还揭示了LLM行为的三大局限:规划验证不足、决策缺乏多终局性以及行动效率低下。我们还评估了Ape-X DQN和PPO两类经典强化学习算法。通过轻量化多模态设计,TowerMind既弥补了现有RTS测试环境的不足,也为AI智能体领域引入了新基准。项目代码已开源(https://github.com/tb6147877/TowerMind)。
English
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).
PDF11January 13, 2026