FoundationPose: 新規物体の統合6D姿勢推定と追跡

要旨

本論文では、6D物体姿勢推定とトラッキングのための統一基盤モデルであるFoundationPoseを提案する。本モデルは、モデルベースとモデルフリーの両設定をサポートする。テスト時に、CADモデルが提供されるか、少数の参照画像が撮影されていれば、新規物体に対して微調整なしで即座に適用可能である。我々は、ニューラル暗黙的表現を用いてこれら2つの設定のギャップを埋め、下流の姿勢推定モジュールを同一の統一フレームワーク下で不変に保つ。大規模な合成データによるトレーニング、大規模言語モデル（LLM）の活用、新規のトランスフォーマーベースアーキテクチャ、およびコントラスティブラーニングの定式化により、強力な汎化性能を実現した。複数の公開データセットを用いた広範な評価により、挑戦的なシナリオや物体を含む状況において、本統一アプローチが各タスクに特化した既存手法を大きく上回ることを示す。さらに、仮定を減らしたにもかかわらず、インスタンスレベル手法と同等の結果を達成する。プロジェクトページ: https://nvlabs.github.io/FoundationPose/

English

We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/

FoundationPose: 新規物体の統合6D姿勢推定と追跡

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

要旨

Support