ChatPaper.aiChatPaper

Dyna-Mind:通过经验学习模拟以打造更优AI智能体

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

October 10, 2025
作者: Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao
cs.AI

摘要

近期,推理模型在数学和编程等领域取得了显著进展。然而,它们在数学和编程方面的专家级能力与其在网页导航、电脑/手机使用等长期交互任务中的表现形成鲜明对比。受人类认知研究的启发,我们认为当前的人工智能代理需要具备“替代性试错”能力——即在行动前进行心理模拟以预见不同未来情景的能力——以提升其在复杂交互环境中的理解与表现。为此,我们提出了Dyna-Mind,一个两阶段训练框架,旨在明确教导视觉语言模型(VLM)代理将此类模拟融入其推理过程。第一阶段,我们引入了“基于模拟的推理”(ReSim),通过代理与环境互动收集的真实经验构建扩展搜索树,并训练代理从中生成结构化推理轨迹。ReSim因此将代理的推理建立在真实世界动态之上,并赋予其在推理中预见未来状态的能力。第二阶段,我们提出了Dyna-GRPO,一种在线强化学习方法,通过利用实际演练中的结果奖励和中间状态作为反馈,进一步增强代理的模拟与决策能力。在Sokoban、ALFWorld两个合成基准和AndroidWorld这一现实基准上的实验表明:(1)ReSim有效将模拟能力注入AI代理;(2)Dyna-GRPO利用结果和交互层面的信号,学习到更适合长期规划密集型任务的策略。这些成果共同凸显了模拟在使AI代理在日益复杂的环境中更有效推理、规划与行动中的核心作用。
English
Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
PDF62October 13, 2025