Point Transformer V3: よりシンプルに、より高速に、より強力に

要旨

本論文は、アテンションメカニズム内での革新を追求することを主目的としているわけではありません。その代わりに、点群処理の文脈において、スケールの力を活用することで、精度と効率性の間の既存のトレードオフを克服することに焦点を当てています。3D大規模表現学習における最近の進展からインスピレーションを得て、モデルの性能は複雑な設計よりもスケールによってより大きく影響を受けることを認識しました。そこで、スケーリング後の全体性能に対して些末な特定のメカニズムの精度よりも、シンプルさと効率性を優先したPoint Transformer V3（PTv3）を提案します。例えば、KNNによる精密な近傍探索を、特定のパターンで組織化された点群の効率的なシリアライズド近傍マッピングに置き換えるなどです。この原則により、受容野を16点から1024点に大幅に拡大しながらも効率的な処理を実現し（前身であるPTv2と比較して処理速度が3倍、メモリ効率が10倍向上）、20以上の下流タスクにおいて室内外のシナリオをカバーする最先端の結果を達成しました。さらに、複数データセットの共同学習により強化されたPTv3は、これらの結果をより高いレベルに押し上げています。

English

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

Point Transformer V3: よりシンプルに、より高速に、より強力に

Point Transformer V3: Simpler, Faster, Stronger

要旨

Support