一生一学：从无指导探索中推断随机环境的符号世界模型

摘要

符号化世界建模需要将环境的转移动态推断并表示为可执行程序。先前的研究主要集中于具有丰富交互数据、简单机制和人类指导的确定性环境。我们则针对一个更为现实且具挑战性的场景：在复杂、随机的环境中学习，其中智能体仅拥有“一次生命”来探索一个充满敌意的环境，且无人类指导。我们提出了OneLife框架，该框架通过概率编程框架中的条件激活程序化法则来建模世界动态。每条法则通过前提-效果结构运作，在相关世界状态下激活。这创建了一个动态计算图，仅通过相关法则进行推理和优化，避免了在复杂分层状态下所有法则共同参与预测时的扩展难题，并使得即使在规则激活稀疏的情况下也能学习随机动态。为了在这些苛刻约束下评估我们的方法，我们引入了一种新的评估协议，衡量（a）状态排序能力，即区分可能未来状态与不可能状态的能力，以及（b）状态保真度，即生成与现实高度相似的未来状态的能力。我们在Crafter-OO上开发并评估了我们的框架，这是我们对Crafter环境的重新实现，它暴露了一个结构化的、面向对象的符号状态以及仅在该状态上操作的纯转移函数。OneLife能够从最少且无指导的交互中成功学习关键环境动态，在测试的23个场景中有16个超越了强基线。我们还测试了OneLife的规划能力，模拟推演成功识别了更优策略。我们的工作为自主构建未知复杂环境的程序化世界模型奠定了基础。

English

Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

一生一学：从无指导探索中推断随机环境的符号世界模型

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

摘要

Support