基于语言模型先验从观测中学习POMDP世界模型

摘要

无论是导航建筑、操控机器人还是玩游戏，要在环境中有效行动的智能体必须首先学习该环境运作的内在模型。部分可观测马尔可夫决策过程（POMDP）为此类内部世界模型提供了灵活建模框架，但仅凭观测-行动轨迹学习这些模型极具挑战性，通常需要大量环境交互。我们探究语言模型先验能否通过利用先验知识减少昂贵的交互成本，并提出了Pinductor（POMDP诱导器）：通过少量观测-行动轨迹，大语言模型生成候选POMDP模型，并基于信念似然得分迭代优化这些模型。尽管使用了严格更少的信息，Pinductor在性能和样本效率上均能与假设可访问隐状态的基于LLM的POMDP学习方法相媲美，同时显著优于表格型POMDP基线方法的样本效率。进一步结果表明，其性能随LLM能力提升而扩展，且在隐藏环境语义信息时性能呈优雅退化。这些结果共同表明，语言模型先验可作为部分可观测环境下样本高效世界模型学习的实用工具，并朝着通用智能体在现实环境中的应用迈进一步。代码已开源：https://github.com/atomresearch/pinductor。

English

Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce Pinductor (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, Pinductor matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at https://github.com/atomresearch/pinductor.