可扩展的多任务强化学习:面向视觉运动代理的通用空间智能
Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents
July 31, 2025
作者: Shaofei Cai, Zhancun Mu, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang
cs.AI
摘要
尽管强化学习(RL)在语言建模领域取得了显著成就,但其成功尚未完全转化为视觉运动智能体。RL模型面临的一个主要挑战是它们容易对特定任务或环境过拟合,从而阻碍了在不同场景下获得可泛化行为的能力。本文通过展示在Minecraft中经过RL微调的视觉运动智能体能够实现对新世界的零样本泛化,为这一挑战提供了初步解答。具体而言,我们探索了RL在增强3D世界中可泛化的空间推理与交互能力方面的潜力。针对多任务RL表示中的挑战,我们分析并确立了跨视图目标指定作为视觉运动策略的统一多任务目标空间。此外,为克服手动任务设计的重大瓶颈,我们提出了在高度可定制的Minecraft环境中进行自动化任务合成,以支持大规模多任务RL训练,并构建了一个高效的分布式RL框架来支撑这一过程。实验结果表明,RL显著提升了交互成功率达4倍,并实现了跨多样环境(包括现实世界场景)的空间推理零样本泛化。我们的发现凸显了在3D模拟环境中,尤其是那些适合大规模任务生成的环境中进行RL训练的巨大潜力,这对于显著推进视觉运动智能体的空间推理能力具有重要意义。
English
While Reinforcement Learning (RL) has achieved remarkable success in language
modeling, its triumph hasn't yet fully translated to visuomotor agents. A
primary challenge in RL models is their tendency to overfit specific tasks or
environments, thereby hindering the acquisition of generalizable behaviors
across diverse settings. This paper provides a preliminary answer to this
challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can
achieve zero-shot generalization to unseen worlds. Specifically, we explore
RL's potential to enhance generalizable spatial reasoning and interaction
capabilities in 3D worlds. To address challenges in multi-task RL
representation, we analyze and establish cross-view goal specification as a
unified multi-task goal space for visuomotor policies. Furthermore, to overcome
the significant bottleneck of manual task design, we propose automated task
synthesis within the highly customizable Minecraft environment for large-scale
multi-task RL training, and we construct an efficient distributed RL framework
to support this. Experimental results show RL significantly boosts interaction
success rates by 4times and enables zero-shot generalization of spatial
reasoning across diverse environments, including real-world settings. Our
findings underscore the immense potential of RL training in 3D simulated
environments, especially those amenable to large-scale task generation, for
significantly advancing visuomotor agents' spatial reasoning.