3D-LFM: リフティング基盤モデル

要旨

2Dランドマークから3D構造とカメラを推定することは、コンピュータビジョン分野全体の基盤をなす重要な課題である。従来の手法は、Perspective-n-Point（PnP）問題におけるような特定の剛体物体に限定されていたが、深層学習の進展により、ノイズ、オクルージョン、遠近法の歪みに対して頑健な、幅広い物体クラス（例：C3PDOやPAUL）の再構築が可能となった。しかし、これらの手法はすべて、3D学習データ間の対応関係を確立するという根本的な必要性に制約されており、「対応関係のある」3Dデータが豊富に存在するアプリケーションにその有用性が大きく限定されていた。我々のアプローチは、トランスフォーマーの持つ本質的な順序等価性を活用し、3Dデータインスタンスごとに異なる点の数を管理し、オクルージョンに耐え、未見のカテゴリーにも一般化する。我々は、2D-3Dリフティングタスクのベンチマークにおいて最先端の性能を実証する。我々のアプローチは、これほど広範な構造クラスにわたって学習可能であるため、これを単に3Dリフティング基盤モデル（3D-LFM）と呼ぶ。これはその種の最初のモデルである。

English

The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.

3D-LFM: リフティング基盤モデル

3D-LFM: Lifting Foundation Model

要旨

Support