RVT：用于3D物体操作的机器视图转换器

摘要

对于3D物体操作，那些构建显式3D表示的方法比仅依赖摄像头图像的方法表现更好。但是，使用诸如体素之类的显式3D表示会带来巨大的计算成本，对可扩展性产生不利影响。在这项工作中，我们提出了RVT，这是一个用于3D操作的多视图变换器，既具有可扩展性又准确性。RVT的一些关键特点包括注意机制，用于跨视图聚合信息，并从机器人工作区域周围的虚拟视图重新渲染摄像头输入。在模拟中，我们发现单个RVT模型在18个RLBench任务中表现良好，有249个任务变体，相对成功率比现有的最先进方法（PerAct）高出26%。它的训练速度也比PerAct快36倍，同时实现相同性能，并且推断速度比PerAct快2.3倍。此外，RVT可以仅通过少量（约10个）每项任务的演示在现实世界中执行各种操作任务。我们在https://robotic-view-transformer.github.io/提供了视觉结果、代码和训练模型。

English

For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability. In this work, we propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. Some key features of RVT are an attention mechanism to aggregate information across views and re-rendering of the camera input from virtual views around the robot workspace. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (sim10) demonstrations per task. Visual results, code, and trained model are provided at https://robotic-view-transformer.github.io/.

RVT：用于3D物体操作的机器视图转换器

RVT: Robotic View Transformer for 3D Object Manipulation

摘要

Support