RVT：用於3D物體操作的機器視圖轉換器

摘要

對於3D物體操作，使用建立明確3D表示的方法比僅依賴相機影像的方法表現更好。但是，使用像體素這樣的明確3D表示會帶來巨大的計算成本，對可擴展性產生不利影響。在這項工作中，我們提出了RVT，一種適用於3D操作的多視圖Transformer，具有可擴展性和準確性。RVT的一些關鍵特點包括注意機制，用於跨視圖聚合信息，以及從機器人工作空間周圍的虛擬視圖重新呈現相機輸入。在模擬中，我們發現單個RVT模型在18個RLBench任務中的249個任務變化上表現良好，相對成功率比現有的最先進方法（PerAct）高出26%。它還比PerAct快36倍訓練，以達到相同的性能，並實現PerAct的推理速度的2.3倍。此外，RVT可以僅通過每個任務的少量（約10個）示範在現實世界中執行各種操作任務。有關視覺結果、代碼和訓練模型，請參見https://robotic-view-transformer.github.io/。

English

For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability. In this work, we propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. Some key features of RVT are an attention mechanism to aggregate information across views and re-rendering of the camera input from virtual views around the robot workspace. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (sim10) demonstrations per task. Visual results, code, and trained model are provided at https://robotic-view-transformer.github.io/.

RVT：用於3D物體操作的機器視圖轉換器

RVT: Robotic View Transformer for 3D Object Manipulation

摘要

Support