可擴展的多任務強化學習：實現視覺運動代理的通用空間智能

摘要

儘管強化學習（Reinforcement Learning, RL）在語言建模領域取得了顯著成功，但其成就尚未完全轉化到視覺運動代理中。RL模型面臨的一個主要挑戰是其容易過度擬合特定任務或環境，從而阻礙了在不同情境下獲取可泛化行為的能力。本文通過展示在Minecraft中經過RL微調的視覺運動代理能夠實現對未見過世界的零樣本泛化，為這一挑戰提供了初步解答。具體而言，我們探討了RL在增強3D世界中可泛化的空間推理與互動能力方面的潛力。為應對多任務RL表示中的挑戰，我們分析並確立了跨視角目標指定作為視覺運動策略的統一多任務目標空間。此外，為克服手動任務設計的重大瓶頸，我們提出了在高度可定制的Minecraft環境中進行自動化任務合成，以支持大規模多任務RL訓練，並構建了一個高效的分散式RL框架來支持這一過程。實驗結果顯示，RL將互動成功率顯著提升了4倍，並實現了包括現實世界在內的多樣化環境中空間推理的零樣本泛化。我們的研究結果強調了在3D模擬環境中進行RL訓練的巨大潛力，尤其是那些適合大規模任務生成的環境，對於顯著提升視覺運動代理的空間推理能力具有重要意義。

English

While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.