從觀測中學習具有語言模型先驗的POMDP世界模型

摘要

无论是导航建筑、操作机器人还是玩游戏，一个在环境中有效行动的智能体必须首先学习该环境运作的内部模型。部分可观测马尔可夫决策过程（POMDP）为此类内部世界模型提供了灵活的建模类别，但仅从观测-动作轨迹中学习此类模型颇具挑战性，通常需要大量的环境交互。我们探究语言模型先验能否通过利用先验知识减少昂贵的交互成本，并引入Pinductor（POMDP-推导器）：一种让大语言模型（LLM）从少量观测-动作轨迹中提出候选POMDP模型，并通过迭代优化基于信念的似然分数的框架。尽管使用的信息严格更少，Pinductor在性能与样本效率上均能与假设可获取隐藏状态特权信息的基于LLM的POMDP学习方法相匹敌，同时显著优于表格型POMDP基线方法的样本效率。进一步结果表明，其性能随LLM能力提升而扩展，并在隐藏环境语义信息时表现出优雅的性能衰退。这些结果共同表明，语言模型先验可作为在部分可观测条件下实现样本高效世界模型学习的实用工具，并向通用智能体在实际环境中的应用迈进一步。代码开源地址：https://github.com/atomresearch/pinductor。

English

Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce Pinductor (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, Pinductor matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at https://github.com/atomresearch/pinductor.