ロボット学習のための事前学習済み視覚モデルのデータ中心的な再検討

要旨

事前学習済み視覚モデル（PVMs）は現代ロボティクスの基盤であるが、その最適な設定は依然として明確ではない。体系的評価を通じて、DINOとiBOTがMAEを視覚運動制御および知覚タスクにおいて上回る一方で、非（単一）物体中心（NOC）データで学習させた場合に苦戦することがわかった。この制限は、物体中心表現を学習する能力の低下と強く相関している。本研究は、非物体中心のロボティクスデータセットから物体中心表現を形成する能力がPVMsの成功の鍵であることを示唆している。この発見に基づき、我々はSlotMIMを設計した。この手法は、セマンティックボトルネックを導入してプロトタイプの数を削減し、物体性の出現を促進するとともに、多視点不変性を促すためのクロスビュー一貫性正則化を導入することで、物体中心表現を誘導する。我々の実験は、物体中心、シーン中心、ウェブクロール、エゴ中心データでの事前学習を含む。全ての設定において、我々のアプローチは転移可能な表現を学習し、画像認識、シーン理解、ロボット学習評価において従来の研究を大幅に改善した。百万規模のデータセットでスケールアップした場合、我々の手法は優れたデータ効率性とスケーラビリティも示した。我々のコードとモデルはhttps://github.com/CVMI-Lab/SlotMIMで公開されている。

English

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.

ロボット学習のための事前学習済み視覚モデルのデータ中心的な再検討

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

要旨

Support