RVT-2: 少数のデモンストレーションから精密な操作を学習する

要旨

本研究では、言語指示に基づいて複数の3D操作タスクを解決できるロボットシステムの構築方法を探求します。産業および家庭領域で有用であるためには、このようなシステムは少数のデモンストレーションで新しいタスクを学習し、正確に解決できる必要があります。PerActやRVTなどの先行研究はこの問題に取り組んできましたが、高精度を必要とするタスクではしばしば苦戦しています。私たちは、これらのシステムをより効果的で正確かつ高速にする方法を研究します。アーキテクチャとシステムレベルの改善を組み合わせることで、トレーニング速度が6倍、推論速度が2倍向上したマルチタスク3D操作モデルであるRVT-2を提案します。RVT-2はRLBenchにおいて新たな最先端を達成し、成功率を65%から82%に向上させました。RVT-2は現実世界でも有効であり、プラグのピックアップや挿入といった高精度を要するタスクをわずか10回のデモンストレーションで学習できます。視覚的な結果、コード、および学習済みモデルは以下のURLで提供されています: https://robotic-view-transformer-2.github.io/

English

In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: https://robotic-view-transformer-2.github.io/.

RVT-2: 少数のデモンストレーションから精密な操作を学習する

RVT-2: Learning Precise Manipulation from Few Demonstrations

要旨

Support