ChatPaper.aiChatPaper

Dyna-Mind:從經驗中學習模擬,打造更優AI代理

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

October 10, 2025
作者: Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao
cs.AI

摘要

推理模型在數學和編程等領域近期展現了顯著的進展。然而,它們在數學和編程方面的專家級能力,與其在長期互動任務(如網絡導航和電腦/手機使用)中的表現形成了鮮明對比。受人類認知文獻的啟發,我們認為當前的人工智慧代理需要「替代性試錯」能力——即在行動前能夠在心理上模擬不同未來情景的能力——以提升其在複雜互動環境中的理解與表現。我們提出了Dyna-Mind,這是一個兩階段的訓練框架,旨在明確教導(視覺)語言模型代理將此類模擬整合到其推理過程中。在第一階段,我們引入了「基於模擬的推理」(ReSim),該方法訓練代理從通過環境互動收集的真實經驗構建的擴展搜索樹中生成結構化的推理軌跡。ReSim因此將代理的推理基於真實的世界動態,並賦予其在推理中預測未來狀態的能力。在第二階段,我們提出了Dyna-GRPO,這是一種在線強化學習方法,通過使用來自實際推演的結果獎勵和中間狀態作為反饋,進一步增強代理的模擬與決策能力。在兩個合成基準(Sokoban和ALFWorld)和一個現實基準(AndroidWorld)上的實驗表明:(1)ReSim有效地將模擬能力注入人工智慧代理;(2)Dyna-GRPO利用結果和互動層面的信號,學習到更優的長期規劃密集型任務策略。這些結果共同凸顯了模擬在使人工智慧代理在日益挑戰的環境中更有效地推理、規劃和行動中的核心作用。
English
Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
PDF62October 13, 2025