星际争霸II中基于世界模型的策略优化
World Models for Policy Refinement in StarCraft II
February 16, 2026
作者: Yixin Zhang, Ziyi Wang, Yiming Rong, Haoxi Wang, Jinling Jiang, Shuang Xu, Haoran Wu, Shiyu Zhou, Bo Xu
cs.AI
摘要
近期,大型语言模型(LLMs)展现出强大的推理与泛化能力,这推动了其作为决策策略在复杂环境中的应用。星际争霸II(SC2)因其庞大的状态-动作空间与部分可观测性,成为极具挑战性的测试平台。然而,现有基于LLM的SC2智能体主要聚焦于策略本身优化,忽视了在决策循环中集成可学习的动作条件转移模型。为弥补这一空白,我们提出首个面向SC2部分可观测环境的世界模型StarWM,其能够预测未来观测状态。为有效学习SC2的混合动态特性,我们设计了结构化文本表征方法,将观测状态解耦为五个语义模块,并构建了首个SC2动态预测指令调优数据集SC2-Dynamics-50k。进一步开发了面向结构化观测预测的多维度离线评估框架。离线实验表明,StarWM相较零样本基线取得显著提升,资源预测准确率提升近60%,己方宏观态势一致性显著增强。最后,我们提出StarWM-Agent——一个融合世界模型的增强决策系统,通过将StarWM嵌入"生成-模拟-优化"决策循环实现前瞻驱动的策略 refinement。针对SC2内置AI的在线评估显示,该系统在Hard(LV5)、Harder(LV6)和VeryHard(LV7)难度下分别实现30%、15%和30%的胜率提升,同时展现出更稳定的宏观运营能力与战术风险评估水平。
English
Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate--Simulate--Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2's built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.