スケーラブルなマルチタスク強化学習による視覚運動エージェントの汎用的空間知能

要旨

強化学習（RL）は言語モデリングにおいて顕著な成功を収めてきたが、その勝利は視覚運動エージェントにはまだ完全には適用されていない。RLモデルの主な課題は、特定のタスクや環境に過剰適合する傾向があり、それによって多様な設定での汎用的な行動の獲得が妨げられることである。本論文は、MinecraftにおいてRLでファインチューニングされた視覚運動エージェントが未見の世界にゼロショット汎化を達成できることを示すことで、この課題に対する予備的な回答を提供する。具体的には、3D世界における汎用的な空間推論と相互作用能力を強化するためのRLの可能性を探る。マルチタスクRL表現における課題に対処するため、視覚運動ポリシーの統一的なマルチタスク目標空間として、クロスビュー目標指定を分析し確立する。さらに、手動タスク設計の大きなボトルネックを克服するために、高度にカスタマイズ可能なMinecraft環境内での自動タスク合成を提案し、大規模マルチタスクRLトレーニングをサポートする効率的な分散RLフレームワークを構築する。実験結果は、RLが相互作用成功率を4倍に向上させ、現実世界の設定を含む多様な環境での空間推論のゼロショット汎化を可能にすることを示している。我々の知見は、特に大規模タスク生成に適した3Dシミュレーション環境におけるRLトレーニングの巨大な潜在能力を強調し、視覚運動エージェントの空間推論を大幅に進歩させる可能性を示している。

English

While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.