基于因子化世界状态的奖励预测
Reward Prediction with Factorized World States
March 10, 2026
作者: Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale Fung
cs.AI
摘要
智能体必须推断行动结果,并选择能最大化奖励信号的行为,该信号指示目标达成的接近程度。基于监督学习的奖励模型可能会引入训练数据固有的偏差,从而限制其对新目标和环境的泛化能力。本文研究定义明确的世界状态表征是否能够独立实现跨领域的精确奖励预测。为此,我们提出StateFactory——一种因子化表征方法,利用语言模型将非结构化观测转换为分层级的对象-属性结构。这种结构化表征使得奖励能够通过当前状态与目标状态在层级约束下的语义相似度进行自然估算。总体而言,StateFactory诱导的紧凑表征结构赋予了强大的奖励泛化能力。我们在RewardPrediction基准数据集上开展评估,该数据集涵盖五个不同领域,包含2,454条独特的行为-观测轨迹及逐步真实奖励。实验表明,本方法在零样本设定下相较VLWM-critic和LLM-as-a-Judge奖励模型分别降低EPIC距离60%和8%,展现出显著优势。更关键的是,这种优越的奖励质量能有效转化为智能体规划性能的提升:在AlfWorld和ScienceWorld环境中,相较于反应式系统1策略分别实现+21.64%和+12.40%的成功率增益,并显著增强系统2智能体的规划能力。项目页面:https://statefactory.github.io
English
Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io