通过规则学习实现的WALL-E：世界对齐改进基于世界模型的LLM代理

摘要

大型语言模型（LLMs）能直接作为基于模型的代理的强大世界模型吗？尽管存在LLMs的先验知识与指定环境动态之间的差距，但我们的研究表明，通过将LLM与部署的环境对齐，这些差距是可以弥合的，而这种“世界对齐”可以通过LLMs上的规则学习高效实现。鉴于LLMs丰富的先验知识，只需少量附加规则即可使LLM预测与指定环境动态对齐。为此，我们提出了一种神经符号化方法，通过LLMs无梯度地学习这些规则，通过对比代理探索轨迹和世界模型预测来诱导、更新和修剪规则。最终的世界模型由LLM和学习到的规则组成。我们的具身LLM代理“WALL-E”基于模型预测控制（MPC）构建。通过根据精确的世界模型优化前瞻动作，MPC显著提高了探索和学习效率。与现有的LLM代理相比，WALL-E的推理只需要少量主要规则，而不需要将冗长的缓冲轨迹包含在LLM输入中。在Minecraft和ALFWorld的开放世界挑战中，WALL-E的成功率高于现有方法，重新规划时间和推理所用令牌数量更少。在Minecraft中，WALL-E的成功率比基线高出15-30％，重新规划轮次减少8-20轮，仅使用60-80％的令牌。在ALFWorld中，仅经过6次迭代，其成功率飙升至95％，创下了新的记录高点。

English

Can large language models (LLMs) directly serve as powerful world models for model-based agents? While the gaps between the prior knowledge of LLMs and the specified environment's dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such "world alignment" can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient-free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent-explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC). By optimizing look-ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL-E's reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL-E exceeds baselines by 15-30% in success rate while costing 8-20 fewer replanning rounds and only 60-80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.

通过规则学习实现的WALL-E：世界对齐改进基于世界模型的LLM代理

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

摘要

Support