3D-LFM: 리프팅 파운데이션 모델

초록

2D 랜드마크로부터 3D 구조와 카메라를 복원하는 작업은 컴퓨터 비전 분야의 초석을 이루는 핵심 기술입니다. 기존의 전통적인 방법들은 Perspective-n-Point (PnP) 문제와 같은 특정한 강체 물체에 국한되어 있었지만, 딥러닝 기술의 발전으로 인해 다양한 객체 클래스(예: C3PDO 및 PAUL)를 노이즈, 폐색, 원근 왜곡에 강인하게 복원할 수 있는 능력이 확장되었습니다. 그러나 이러한 모든 기술들은 3D 학습 데이터 간의 대응 관계를 설정해야 한다는 근본적인 한계에 의해 제약을 받아 왔으며, 이는 "대응 관계가 있는" 3D 데이터가 풍부한 응용 분야에서만 유용성을 발휘할 수 있음을 의미했습니다. 우리의 접근 방식은 트랜스포머(transformer)의 고유한 순열 등변성(permutation equivariance)을 활용하여 3D 데이터 인스턴스마다 다양한 수의 점을 처리하고, 폐색에 견디며, 보지 못한 카테고리로도 일반화할 수 있습니다. 우리는 2D-3D 복원 작업 벤치마크에서 최첨단 성능을 입증합니다. 우리의 접근 방식은 매우 광범위한 구조 클래스에 걸쳐 학습될 수 있기 때문에, 이를 단순히 3D 복원 기초 모델(3D Lifting Foundation Model, 3D-LFM)이라고 부릅니다. 이는 그 종류 중 최초의 모델입니다.

English

The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.

3D-LFM: 리프팅 파운데이션 모델

3D-LFM: Lifting Foundation Model

초록

Support