포인트 트랜스포머 V3: 더 간단하고, 더 빠르고, 더 강력한 모델

초록

본 논문은 어텐션 메커니즘 내에서의 혁신을 추구하기 위한 것이 아닙니다. 대신, 점군(point cloud) 처리의 맥락에서 정확도와 효율성 간의 기존 트레이드오프를 극복하고, 규모의 힘을 활용하는 데 초점을 맞추고 있습니다. 최근 3D 대규모 표현 학습의 발전에서 영감을 얻어, 모델 성능이 복잡한 설계보다는 규모에 더 큰 영향을 받는다는 점을 인식했습니다. 따라서 우리는 Point Transformer V3(PTv3)를 제안하며, 이는 전체 성능에 미미한 영향을 미치는 특정 메커니즘의 정확도보다는 단순성과 효율성을 우선시합니다. 예를 들어, KNN을 통한 정확한 이웃 탐색을 특정 패턴으로 조직된 점군의 효율적인 직렬화된 이웃 매핑으로 대체하는 것이 그 예입니다. 이러한 원칙은 상당한 규모 확장을 가능하게 하여, 수용 필드를 16개 점에서 1024개 점으로 확장하면서도 효율성을 유지합니다(전작인 PTv2 대비 처리 속도 3배 증가, 메모리 효율성 10배 개선). PTv3은 실내 및 실외 시나리오를 아우르는 20개 이상의 다운스트림 작업에서 최첨단 결과를 달성했습니다. 또한, 다중 데이터셋 공동 학습을 통해 더욱 향상된 PTv3은 이러한 결과를 더 높은 수준으로 끌어올립니다.

English

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

포인트 트랜스포머 V3: 더 간단하고, 더 빠르고, 더 강력한 모델

Point Transformer V3: Simpler, Faster, Stronger

초록

Support