PF-LRM: ポーズフリー大規模再構成モデルによる姿勢と形状の同時予測

要旨

少数の未配置画像から3Dオブジェクトを再構築し、視覚的な重なりがほとんどない場合でも相対的なカメラポーズを約1.3秒で推定するPose-Free Large Reconstruction Model（PF-LRM）を提案します。PF-LRMは、単一のA100 GPU上で動作する高度にスケーラブルな手法であり、3Dオブジェクトトークンと2D画像トークン間の情報交換にセルフアテンションブロックを利用します。各ビューに対して粗い点群を予測し、微分可能なPerspective-n-Point（PnP）ソルバーを使用してカメラポーズを取得します。約100万オブジェクトの多視点ポーズデータを大量に学習することで、PF-LRMは強力なクロスデータセット汎化能力を示し、様々な未見の評価データセットにおいてポーズ予測精度と3D再構築品質の両方でベースライン手法を大きく上回ります。また、高速なフィードフォワード推論による下流のテキスト/画像から3Dへのタスクにおけるモデルの適用性も実証します。プロジェクトのウェブサイトは以下にあります：https://totoro97.github.io/pf-lrm

English

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .

PF-LRM: ポーズフリー大規模再構成モデルによる姿勢と形状の同時予測

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

要旨

Support