PF-LRM: 포즈 프리 대형 재구성 모델 - 포즈와 형태의 통합 예측을 위한 모델

초록

우리는 시각적 중첩이 거의 없는 소수의 비정렬(unposed) 이미지들로부터 3D 객체를 재구성하고, 동시에 상대적 카메라 포즈를 단일 A100 GPU에서 약 1.3초 내에 추정하는 Pose-Free Large Reconstruction Model (PF-LRM)을 제안합니다. PF-LRM은 3D 객체 토큰과 2D 이미지 토큰 간의 정보 교환을 위해 self-attention 블록을 활용하는 고도로 확장 가능한 방법으로, 각 뷰에 대한 대략적인 포인트 클라우드를 예측한 후 미분 가능한 Perspective-n-Point (PnP) 솔버를 사용하여 카메라 포즈를 얻습니다. 약 100만 개의 객체에 대한 다량의 다중 뷰 정렬 데이터로 학습된 PF-LRM은 강력한 데이터셋 간 일반화 능력을 보여주며, 다양한 평가 데이터셋에서 포즈 예측 정확도와 3D 재구성 품질 측면에서 기준 방법들을 큰 차이로 능가합니다. 또한, 우리는 빠른 순방향 추론(feed-forward inference)을 통해 텍스트/이미지에서 3D로의 다운스트림 작업에서 모델의 적용 가능성을 입증합니다. 프로젝트 웹사이트는 https://totoro97.github.io/pf-lrm 에서 확인할 수 있습니다.

English

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .

PF-LRM: 포즈 프리 대형 재구성 모델 - 포즈와 형태의 통합 예측을 위한 모델

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

초록

Support