學習在任何地方進行操作：一個視覺通用框架用於強化學習

摘要

我們能賦予視覺運動機器人在各種多樣的開放式場景中操作的泛化能力嗎？在本文中，我們提出了Maniwhere，這是一個針對視覺強化學習量身定制的通用框架，使訓練過的機器人策略能夠在多種視覺干擾類型的組合中實現泛化。具體來說，我們引入了一種多視圖表示學習方法，融合了空間變換網絡（STN）模塊，以捕獲不同視點之間的共享語義信息和對應。此外，我們採用基於課程的隨機化和擴增方法來穩定強化學習訓練過程，增強視覺泛化能力。為了展示Maniwhere的有效性，我們精心設計了8個任務，包括具體對象、雙手操作和靈巧手操作任務，展示了Maniwhere在3個硬件平台上的強大視覺泛化和從模擬到真實的轉移能力。我們的實驗表明，Maniwhere明顯優於現有的最先進方法。視頻可在https://gemcollector.github.io/maniwhere/ 上查看。

English

Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose Maniwhere, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.

學習在任何地方進行操作：一個視覺通用框架用於強化學習

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

摘要

Support