RIG:端到端通用策略中推理與想像的協同融合
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
March 31, 2025
作者: Zhonghan Zhao, Wenwei Zhang, Haian Huang, Kuikun Liu, Jianfei Gao, Gaoang Wang, Kai Chen
cs.AI
摘要
在複雜的開放世界環境中運作的具身智能體,行動前的推理與潛在結果的想像(即世界模型)至關重要。然而,先前的研究要么僅在端到端智能體中融入其中一種能力,要么將多個專用模型整合到智能體系統中,這限制了策略的學習效率和泛化能力。因此,本文首次嘗試在端到端的通用策略中協同推理與想像,稱為RIG。為了以端到端的方式訓練RIG,我們構建了一個數據管道,逐步整合並豐富從現有智能體收集的軌跡中的想像與推理內容。推理與下一幀圖像生成的聯合學習,明確建模了推理、行動與環境動態之間的內在關聯,從而展現出相比以往工作超過17倍的樣本效率提升和泛化能力。在推理過程中,RIG首先推理下一個行動,生成潛在行動,然後預測行動結果,這為智能體提供了在採取實際行動前基於想像進行審視和自我修正的機會。實驗結果表明,推理與想像的協同不僅提升了通用策略的魯棒性、泛化能力和互操作性,還能夠通過測試時的擴展來增強整體性能。
English
Reasoning before action and imagining potential outcomes (i.e., world models)
are essential for embodied agents operating in complex open-world environments.
Yet, prior work either incorporates only one of these abilities in an
end-to-end agent or integrates multiple specialized models into an agent
system, limiting the learning efficiency and generalization of the policy.
Thus, this paper makes the first attempt to synergize Reasoning and Imagination
in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end
manner, we construct a data pipeline that progressively integrates and enriches
the content of imagination and reasoning in the trajectories collected from
existing agents. The joint learning of reasoning and next image generation
explicitly models the inherent correlation between reasoning, action, and
dynamics of environments, and thus exhibits more than 17times sample
efficiency improvements and generalization in comparison with previous works.
During inference, RIG first reasons about the next action, produces potential
action, and then predicts the action outcomes, which offers the agent a chance
to review and self-correct based on the imagination before taking real actions.
Experimental results show that the synergy of reasoning and imagination not
only improves the robustness, generalization, and interoperability of
generalist policy but also enables test-time scaling to enhance overall
performance.Summary
AI-Generated Summary